Skip to content

Conversation

valosekj
Copy link
Member

@valosekj valosekj commented Dec 6, 2022

The PR adds details on which files should be included within each data.neuro.polymtl.ca/datasets/ BIDS dataset. Also, I proposed templates for these files.

So far, I have put everything within this repository's README. To view the README in a nice way, visit my_branch.
But maybe, we could move this information to the repo's Wiki section? And just put a link to Wiki to the README?

The PR is probably related also to one older PR.

Some useful links:

@valosekj valosekj changed the title Add BIDS files to README Add details on which files should be included in BIDS dataset Dec 6, 2022
README.md Outdated

## naming convention

sub-<site><pathology>XXX

example:
sub-zurichDCM001
sub-montrealDCM001
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "DCM" from "dicom" or is it the name of a pathology?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Degenerative Cervical Myelopathy

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a point for discussion-- do we want to include site (e.g., montreal) and pathology (e.g., DCM) to the filename? Such filename is still BIDS compliant and is very explicit (which is good --> no need to open the participants.tsv file to check the pathology and site).
On the other hand, for single-site or single-pathology datasets, this information will be redundant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to systematically add this info in the file name (we do it for the INSPIRED study because this is how the data were originally organized, and it is 'easy' to inspect)-- however, this info always needs to be present in the participants.tsv file, because we will use it to filter our multiple DBs for creating deep learning models

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay! Then, I'll upgrade the description as the following:

## naming convention

sub-<institution><pathology>XXX

example:
sub-montrealDCM001

Note: use <institution> and <pathology> only if relevant (e.g., multi-institution dataset or dataset with different pathologies). However, always keep <institution> and <pathology> in the participants.tsv file!!!

README.md Outdated
----------------------------
Dataset shared by: <NAME AND EMAIL>
<IF THERE WAS EMAIL COMM>Email communication: <DATE OF EMAIL AND SUBJECT>
<REPOSITORY OF PROJECT/MODEL, etc>Repository: https://github.com/<organization>/<repository_name>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wary of building in the assumption that datasets and analyses are 1:1. From that diagram I keep referencing on https://handbook.datalad.org/en/latest/basics/101-127-yoda.html, a key feature of reproducible science is that datasets can be reanalysed by other parties or reinterpreted, so that there is a tree of derived works.

If we write this line in, it strongly suggests there can only be a linear descent from each dataset. The clever lab member will see they can just delete this line from the README if they are intending to have their work reused widely, and the clever downstream user will see they can just ignore this line, but the confused and overworked will see this and assume each dataset can only be analysed once.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Maybe we can also make it clearer that more than one repos can be listed (one per project).

README.md Outdated
"contact": "URL to forum/issue AND/OR name AND/OR anything relevant to trace back the image",
}
}
<IF THERE ARE UPDATES IN THE DATASET:><YYYY>-<MM>-<DD>: Added new data from <centre ABC>
Copy link
Contributor

@kousu kousu Dec 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a changelog part of the bids spec? git provides better changelogs.

If we do need to duplicate the changelog into the README, we should mandate that it's done via a script, the way we do e.g. for sct or like this common tool does it.

I would try to not put a changelog in the README at all. Let's encourage people to appreciate git. The work I am doing in https://github.com/neuropoly/computers/issues/167 will help! It will give us a nice accessible frontend to our datasets complete with their history a click away. You can see my demo at https://data.dev.neuropoly.org/neuropoly/spine-generic-single/:

Screenshot 2022-12-06 at 17-38-14 spine-generic-single

Screenshot 2022-12-06 at 17-40-15 spine-generic-single

I am planning to replace the command-line-only internal server ssh://data.neuro.polymtl.ca with this software, within a couple months. And I also want to put the same software on https://data.neuropoly.org for our public datasets.

Copy link
Member

@jcohenadad jcohenadad Dec 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a changelog part of the bids spec?

Yes it is. Good catch. So we should mention in the README that any update should be documented in the CHANGES file. And yes, agreed we should also leverage git.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHANGES is part of the bids spec, but as an optional file (spec link). So I think it would be reasonable to omit the CHANGES file and include something like "please see git log for a history of changes" in README.md.

README.md Outdated

This is an MRI dataset acquired in the context of the <XYZ> project.
It also contains manual segmentation of <MS lesions/tumors/etc> from <one/two/or more> expert raters.
Segmentation are located under the derivatives folder.
Copy link
Contributor

@kousu kousu Dec 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, today we had a good long discussion where we contrasted BIDS's derivatives/ idea against Datalad's YODA idea. We realized they are sort of opposites: Datalad suggests derived data should be a separate repos with pointers back to their sources while BIDS says all derived data should live in the derivatives/ folder in a single repo.

@mguaypaq and I both lean towards the YODA model. BIDS's method is fine for a single researcher, but a researcher trying to collaborate, especially one trying to collaborate across borders, is going to struggle.

But I guess segmentations are a kind of data that's very specific and closely related to the source data. It would be awkward to keep them in a fully separate repo. So maybe segmentations get a pass. But in general I would discourage the use of the derivatives/ folder entirely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I guess segmentations are a kind of data that's very specific and closely related to the source data.

yes, precisely. In fact, in most cases, these are generated manually by radiologists, and as you rightly pointed out, are going in pair with the source data. I do agree though, that for derived (ie: product of source data going through some sort of analysis pipeline), the YODA model is more attractive.

README.md Outdated
Convention for derivatives JSON metadata:

{
"Author": "Firstname Lastname",
Copy link
Contributor

@kousu kousu Dec 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nitpicky but https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/

I'm not sure how to make this more clear though. "<commonly accepted academic name>" is too wordy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting point! Maybe we can ask for the author's ORCID or Google Scholar username?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could but i wouldn't push this too much-- eg: if someone does not have an ORCID of Gscholar username they would be tempted to leave this empty, which would be worst than just putting their names.

README.md Outdated
| sub-001 | HC | 001 | 01 | montreal |

- `participant_id` - unique participant ID
- `pathology` - pathology name; can take values listed in the [pathology column](https://docs.google.com/spreadsheets/d/1yjcA8Z0COn4OZxusIDHjStH2DpeXvscsj-aWE2X-_sg/edit?usp=sharing)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unfamiliar with this part of our work. Do most of our scans come with pathologies? If there's no pathology, do we leave it blank? Or write "n/a" (bids-validator prefers "n/a")?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have HC for Healthy Control

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, maybe I could make the line more explicit:

Suggested change
- `pathology` - pathology name; can take values listed in the [pathology column](https://docs.google.com/spreadsheets/d/1yjcA8Z0COn4OZxusIDHjStH2DpeXvscsj-aWE2X-_sg/edit?usp=sharing)
- `pathology` - pathology name; can take values listed in the [pathology column](https://docs.google.com/spreadsheets/d/1yjcA8Z0COn4OZxusIDHjStH2DpeXvscsj-aWE2X-_sg/edit?usp=sharing). If you are working with healthy control, use `HC`.

README.md Outdated
Comment on lines 110 to 111
- `data_id` - subject ID used to locate the unprocessed data as under `duke/mri/`
- `institution_id` - institution ID used to locate the unprocessed data as under `duke/mri/`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not refer to duke anywhere here. duke will eventually go away. These datasets should be self-contained.

smb://duke.neuro.polymtl.ca/, even the more restricted smb://duke.neuro.polymtl.ca/mri/ subset, is not organized well enough for this ID scheme to work in my opinion. Here's a random sampling of some of the subject IDs from it:

p115628@joplin:~$ find /mnt/duke/mri/ -mindepth 2 -maxdepth 2 -type d | grep -v ukbiobank | sort -R | head
/mnt/duke/mri/hcp/857263
/mnt/duke/mri/hcp/597869
/mnt/duke/mri/ucsf/Myelopathy_Data
/mnt/duke/mri/hcp/679770
/mnt/duke/mri/sct-users/20150723_SpinalCord_1
/mnt/duke/mri/hcp/107018
/mnt/duke/mri/hcp/114621
/mnt/duke/mri/hcp/163331
/mnt/duke/mri/ucl/spine-generic
/mnt/duke/mri/biospective/SpinalCord 3 2015-07-23

So would that be hcp-107018 might be a sensible enough ID, but ucsf-Myelopathy_Data would be silly.

Why do we need to keep this mapping at all? Isn't that kind of sketchy? In an anonymized dataset, doesn't keeping that risk breaking anonymization?

Where we must keep the mapping, the YODA plan would say to upload a separate git-annex repo to the same server and then use e.g. git submodule add rawdata ssh://data.neuro.polymtl.ca/raw/my-raw-dataset. Or, if not git submodule, then [at least add git clone -b v1.0.whatever ssh://data.neuro.polymtl.ca/raw/my-raw-dataset rawdata to the curation script as kept under code/.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not refer to duke anywhere here. duke will eventually go away. These datasets should be self-contained.

I think the idea here is to keep provenance. For example, if researchers share non-curated data in DICOM format, they would go under duke/mri. Then, we would BIDSify it, put it under data.neuro.polymtl.ca and in order not to miss the provenance of, let's say, sub-001 (which corresponds to e.g: MS032_FLASH_T2w-bottom-sag_FA23deg.dcm).

Copy link
Contributor

@kousu kousu Dec 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duke is a bad place to keep provenance because it is editable. It offers a regular filesystem, without version control, with data named by a network URL which will not last forever. Who is going to take care of duke when Jean-Sebastien moves on? Or when the lab grows into a centre and you aren't paying attention to these details anymore?

We could keep DICOMs under git too. What if we kept git+ssh://data.neuro.polymtl.ca/raw/ for DICOMs, and tracked provenance by writing down in each BIDS repo the name of the source DICOM repo and its commit hash. Then if necessary these datasets can be moved around, archived, and unarchived into the future without breaking provenance.

README.md Outdated
Comment on lines 16 to 17
└── code
└── <script_you_used_for_dataset_curation>.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A suggestion: I'm learning that BIDS allows including the original data files in whatever format under a sourcedata/ directory. And since we use git-annex, a simple git clone wouldn't download these large files unless they're explicitly requested with a git annex get sourcedata/. Would this be a good workflow to consider?

Suggested change
└── code
└── <script_you_used_for_dataset_curation>.py
└── sourcedata
└── <DICOM files or whatever>
└── code
└── <script_you_used_for_dataset_curation>.py

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, if we want to use git submodule instead of git annex, this workflow could also be accomplished by storing the DICOMs in a different git repository on data.neuro.polymtl.ca, like data.neuro.polymtl.ca:raw/my_data_set, and making the sourcedata/ folder inside data.neuro.polymtl.ca:datasets/my_data_set be a git submodule. This would very nicely track provenance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some of our datasets, I agree this approach is attractive. For other datasets, there are additional complications (eg: source should not be accessible by all students, anonymization/defacing, one source being used for multiple derived BIDS datasets, etc.).

Copy link
Contributor

@kousu kousu Dec 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote strongly for loose-coupling here!

Storing sourcedata/ in the same repo immediately runs afoul of anonymization rules. It also bloats downloads since while, yes, you can use git annex get sub-001 sub-002 sub-003 ..., most people will do git annex get .. People might reasonably do git annex get derivatives/ or git annex get sub-001, but listing every subject folder just to exclude sourcedata/ is not something we can ask anyone to do. I am pretty confident git-annex does not have any way of excluding folders like git annex get !sourcedata/. Also just in principle, sourcedata feels like a different concept than BIDS data. Storing source data in a separate repo with a pointer is tidier.

I am doubtful git submodule is a good choice, just because UI-wise it is difficult, but having a process that record a URL to upstream datasets (or code) is important. I have had this issue open for a while to discuss what actual technology we can standardize on with the least friction. And eventually we'll want to write/find a linter that periodically scans all the data we have to detect broken provenance links.

(btw, @mguaypaq, we can combine git-annex and git submodule -- that's exactly what Datalad recommends. They're not in opposition. Even if they both give ways to reduce downloads)

README.md Outdated
- `data_id` - subject ID used to locate the unprocessed data as under `duke/mri/`
- `institution_id` - institution ID used to locate the unprocessed data as under `duke/mri/`
- `institution` - human-friendly institution name
- others - if available, include also demographic characteristics such as `age`, `sex`, `height`, `weight`, `researcher`, and [additional columns](https://bids-specification.readthedocs.io/en/stable/03-modality-agnostic-files.html#participants-file).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several times now I've seen these extra columns cause bids-validator to fail. But that doesn't mean we should remove the columns! It just means that the participants.json file needs more entries. Could you add an example of adding one of these column descriptions to the participants.json section below?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I added an explanation.

Suggested change
- others - if available, include also demographic characteristics such as `age`, `sex`, `height`, `weight`, `researcher`, and [additional columns](https://bids-specification.readthedocs.io/en/stable/03-modality-agnostic-files.html#participants-file).
- others - if available, include also demographic characteristics such as `age`, `sex`, `height`, `weight`, `researcher`, and [additional columns](https://bids-specification.readthedocs.io/en/stable/03-modality-agnostic-files.html#participants-file).
❗️If you use some extra column (such as `age`, `sex`, etc), you have to include an appropriate entry also to the `participants.json` file. Otherwise, `bids-validator` fails.

@kousu
Copy link
Contributor

kousu commented Dec 8, 2022 via email

@valosekj
Copy link
Member Author

We went through this PR together with @kousu today. We reviewed it, made some changes, and moved the content to the intranet wiki:
https://intranet.neuro.polymtl.ca/data/dataset-curation.html

This PR adds just a link to the intranet.

@valosekj valosekj marked this pull request as ready for review December 13, 2022 21:54
Copy link
Contributor

@kousu kousu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for putting your head together with me on this today.

@kousu kousu merged commit 39f274a into master Dec 13, 2022
@kousu kousu deleted the jv/add_BIDS_files_to_README branch December 13, 2022 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants