Add details on which files should be included in BIDS dataset #189

valosekj · 2022-12-06T19:42:17Z

The PR adds details on which files should be included within each data.neuro.polymtl.ca/datasets/ BIDS dataset. Also, I proposed templates for these files.

So far, I have put everything within this repository's README. To view the README in a nice way, visit my_branch.
But maybe, we could move this information to the repo's Wiki section? And just put a link to Wiki to the README?

The PR is probably related also to one older PR.

Some useful links:

bids-starter-kit/README_template - this is probably much general, though
bids-starter-kit/derivatives
Where in a BIDS dataset should I put notes about individual MRI acqusitions?

README.md

kousu · 2022-12-06T22:28:40Z

README.md


 ## naming convention

 sub-<site><pathology>XXX

 example:
-sub-zurichDCM001
+sub-montrealDCM001


Is "DCM" from "dicom" or is it the name of a pathology?

Degenerative Cervical Myelopathy

This is a point for discussion-- do we want to include site (e.g., montreal) and pathology (e.g., DCM) to the filename? Such filename is still BIDS compliant and is very explicit (which is good --> no need to open the participants.tsv file to check the pathology and site).
On the other hand, for single-site or single-pathology datasets, this information will be redundant.

no need to systematically add this info in the file name (we do it for the INSPIRED study because this is how the data were originally organized, and it is 'easy' to inspect)-- however, this info always needs to be present in the participants.tsv file, because we will use it to filter our multiple DBs for creating deep learning models

Okay! Then, I'll upgrade the description as the following:

## naming convention sub-<institution><pathology>XXX example: sub-montrealDCM001 Note: use <institution> and <pathology> only if relevant (e.g., multi-institution dataset or dataset with different pathologies). However, always keep <institution> and <pathology> in the participants.tsv file!!!

kousu · 2022-12-06T22:33:03Z

README.md

----------------------------
+Dataset shared by: <NAME AND EMAIL>
+<IF THERE WAS EMAIL COMM>Email communication: <DATE OF EMAIL AND SUBJECT>
+<REPOSITORY OF PROJECT/MODEL, etc>Repository: https://github.com/<organization>/<repository_name>


I am wary of building in the assumption that datasets and analyses are 1:1. From that diagram I keep referencing on https://handbook.datalad.org/en/latest/basics/101-127-yoda.html, a key feature of reproducible science is that datasets can be reanalysed by other parties or reinterpreted, so that there is a tree of derived works.

If we write this line in, it strongly suggests there can only be a linear descent from each dataset. The clever lab member will see they can just delete this line from the README if they are intending to have their work reused widely, and the clever downstream user will see they can just ignore this line, but the confused and overworked will see this and assume each dataset can only be analysed once.

Agreed. Maybe we can also make it clearer that more than one repos can be listed (one per project).

kousu · 2022-12-06T22:42:36Z

README.md

-  "contact": "URL to forum/issue AND/OR name AND/OR anything relevant to trace back the image",
-  }
-}
+<IF THERE ARE UPDATES IN THE DATASET:><YYYY>-<MM>-<DD>: Added new data from <centre ABC>


Is a changelog part of the bids spec? git provides better changelogs.

If we do need to duplicate the changelog into the README, we should mandate that it's done via a script, the way we do e.g. for sct or like this common tool does it.

I would try to not put a changelog in the README at all. Let's encourage people to appreciate git. The work I am doing in https://github.com/neuropoly/computers/issues/167 will help! It will give us a nice accessible frontend to our datasets complete with their history a click away. You can see my demo at https://data.dev.neuropoly.org/neuropoly/spine-generic-single/:

I am planning to replace the command-line-only internal server ssh://data.neuro.polymtl.ca with this software, within a couple months. And I also want to put the same software on https://data.neuropoly.org for our public datasets.

Is a changelog part of the bids spec?

Yes it is. Good catch. So we should mention in the README that any update should be documented in the CHANGES file. And yes, agreed we should also leverage git.

CHANGES is part of the bids spec, but as an optional file (spec link). So I think it would be reasonable to omit the CHANGES file and include something like "please see git log for a history of changes" in README.md.

kousu · 2022-12-06T22:47:39Z

README.md


+This is an MRI dataset acquired in the context of the <XYZ> project. 
+It also contains manual segmentation of <MS lesions/tumors/etc> from <one/two/or more> expert raters. 
+Segmentation are located under the derivatives folder.


For the record, today we had a good long discussion where we contrasted BIDS's derivatives/ idea against Datalad's YODA idea. We realized they are sort of opposites: Datalad suggests derived data should be a separate repos with pointers back to their sources while BIDS says all derived data should live in the derivatives/ folder in a single repo.

@mguaypaq and I both lean towards the YODA model. BIDS's method is fine for a single researcher, but a researcher trying to collaborate, especially one trying to collaborate across borders, is going to struggle.

But I guess segmentations are a kind of data that's very specific and closely related to the source data. It would be awkward to keep them in a fully separate repo. So maybe segmentations get a pass. But in general I would discourage the use of the derivatives/ folder entirely.

But I guess segmentations are a kind of data that's very specific and closely related to the source data.

yes, precisely. In fact, in most cases, these are generated manually by radiologists, and as you rightly pointed out, are going in pair with the source data. I do agree though, that for derived (ie: product of source data going through some sort of analysis pipeline), the YODA model is more attractive.

kousu · 2022-12-06T22:57:17Z

README.md

+Convention for derivatives JSON metadata:
+
+{
+  "Author": "Firstname Lastname",


This is nitpicky but https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/

I'm not sure how to make this more clear though. "<commonly accepted academic name>" is too wordy.

Interesting point! Maybe we can ask for the author's ORCID or Google Scholar username?

we could but i wouldn't push this too much-- eg: if someone does not have an ORCID of Gscholar username they would be tempted to leave this empty, which would be worst than just putting their names.

kousu · 2022-12-06T23:00:49Z

README.md

+| sub-001 | HC | 001 | 01 | montreal |
+
+- `participant_id` - unique participant ID
+- `pathology` - pathology name; can take values listed in the [pathology column](https://docs.google.com/spreadsheets/d/1yjcA8Z0COn4OZxusIDHjStH2DpeXvscsj-aWE2X-_sg/edit?usp=sharing)


I'm unfamiliar with this part of our work. Do most of our scans come with pathologies? If there's no pathology, do we leave it blank? Or write "n/a" (bids-validator prefers "n/a")?

we have HC for Healthy Control

Good point, maybe I could make the line more explicit:

Suggested change

- `pathology` - pathology name; can take values listed in the [pathology column](https://docs.google.com/spreadsheets/d/1yjcA8Z0COn4OZxusIDHjStH2DpeXvscsj-aWE2X-_sg/edit?usp=sharing)

- `pathology` - pathology name; can take values listed in the [pathology column](https://docs.google.com/spreadsheets/d/1yjcA8Z0COn4OZxusIDHjStH2DpeXvscsj-aWE2X-_sg/edit?usp=sharing). If you are working with healthy control, use `HC`.

README.md

kousu · 2022-12-06T23:10:37Z

README.md

+- `data_id` -  subject ID used to locate the unprocessed data as under `duke/mri/`
+- `institution_id` - institution ID used to locate the unprocessed data as under `duke/mri/`


We should not refer to duke anywhere here. duke will eventually go away. These datasets should be self-contained.

smb://duke.neuro.polymtl.ca/, even the more restricted smb://duke.neuro.polymtl.ca/mri/ subset, is not organized well enough for this ID scheme to work in my opinion. Here's a random sampling of some of the subject IDs from it:

p115628@joplin:~$ find /mnt/duke/mri/ -mindepth 2 -maxdepth 2 -type d | grep -v ukbiobank | sort -R | head /mnt/duke/mri/hcp/857263 /mnt/duke/mri/hcp/597869 /mnt/duke/mri/ucsf/Myelopathy_Data /mnt/duke/mri/hcp/679770 /mnt/duke/mri/sct-users/20150723_SpinalCord_1 /mnt/duke/mri/hcp/107018 /mnt/duke/mri/hcp/114621 /mnt/duke/mri/hcp/163331 /mnt/duke/mri/ucl/spine-generic /mnt/duke/mri/biospective/SpinalCord 3 2015-07-23

So would that be hcp-107018 might be a sensible enough ID, but ucsf-Myelopathy_Data would be silly.

Why do we need to keep this mapping at all? Isn't that kind of sketchy? In an anonymized dataset, doesn't keeping that risk breaking anonymization?

Where we must keep the mapping, the YODA plan would say to upload a separate git-annex repo to the same server and then use e.g. git submodule add rawdata ssh://data.neuro.polymtl.ca/raw/my-raw-dataset. Or, if not git submodule, then [at least add git clone -b v1.0.whatever ssh://data.neuro.polymtl.ca/raw/my-raw-dataset rawdata to the curation script as kept under code/.

We should not refer to duke anywhere here. duke will eventually go away. These datasets should be self-contained.

I think the idea here is to keep provenance. For example, if researchers share non-curated data in DICOM format, they would go under duke/mri. Then, we would BIDSify it, put it under data.neuro.polymtl.ca and in order not to miss the provenance of, let's say, sub-001 (which corresponds to e.g: MS032_FLASH_T2w-bottom-sag_FA23deg.dcm).

duke is a bad place to keep provenance because it is editable. It offers a regular filesystem, without version control, with data named by a network URL which will not last forever. Who is going to take care of duke when Jean-Sebastien moves on? Or when the lab grows into a centre and you aren't paying attention to these details anymore?

We could keep DICOMs under git too. What if we kept git+ssh://data.neuro.polymtl.ca/raw/ for DICOMs, and tracked provenance by writing down in each BIDS repo the name of the source DICOM repo and its commit hash. Then if necessary these datasets can be moved around, archived, and unarchived into the future without breaking provenance.

Co-authored-by: Nick <[email protected]>

mguaypaq · 2022-12-08T18:26:49Z

README.md

+└── code
+    └── <script_you_used_for_dataset_curation>.py


A suggestion: I'm learning that BIDS allows including the original data files in whatever format under a sourcedata/ directory. And since we use git-annex, a simple git clone wouldn't download these large files unless they're explicitly requested with a git annex get sourcedata/. Would this be a good workflow to consider?

Suggested change

└── code

└── <script_you_used_for_dataset_curation>.py

└── sourcedata

└── <DICOM files or whatever>

└── code

└── <script_you_used_for_dataset_curation>.py

Alternatively, if we want to use git submodule instead of git annex, this workflow could also be accomplished by storing the DICOMs in a different git repository on data.neuro.polymtl.ca, like data.neuro.polymtl.ca:raw/my_data_set, and making the sourcedata/ folder inside data.neuro.polymtl.ca:datasets/my_data_set be a git submodule. This would very nicely track provenance.

For some of our datasets, I agree this approach is attractive. For other datasets, there are additional complications (eg: source should not be accessible by all students, anonymization/defacing, one source being used for multiple derived BIDS datasets, etc.).

I vote strongly for loose-coupling here!

Storing sourcedata/ in the same repo immediately runs afoul of anonymization rules. It also bloats downloads since while, yes, you can use git annex get sub-001 sub-002 sub-003 ..., most people will do git annex get .. People might reasonably do git annex get derivatives/ or git annex get sub-001, but listing every subject folder just to exclude sourcedata/ is not something we can ask anyone to do. I am pretty confident git-annex does not have any way of excluding folders like git annex get !sourcedata/. Also just in principle, sourcedata feels like a different concept than BIDS data. Storing source data in a separate repo with a pointer is tidier.

I am doubtful git submodule is a good choice, just because UI-wise it is difficult, but having a process that record a URL to upstream datasets (or code) is important. I have had this issue open for a while to discuss what actual technology we can standardize on with the least friction. And eventually we'll want to write/find a linter that periodically scans all the data we have to detect broken provenance links.

(btw, @mguaypaq, we can combine git-annex and git submodule -- that's exactly what Datalad recommends. They're not in opposition. Even if they both give ways to reduce downloads)

mguaypaq · 2022-12-08T18:36:12Z

README.md

+- `data_id` -  subject ID used to locate the unprocessed data as under `duke/mri/`
+- `institution_id` - institution ID used to locate the unprocessed data as under `duke/mri/`
+- `institution` - human-friendly institution name
+- others - if available, include also demographic characteristics such as `age`, `sex`, `height`, `weight`, `researcher`, and [additional columns](https://bids-specification.readthedocs.io/en/stable/03-modality-agnostic-files.html#participants-file).


Several times now I've seen these extra columns cause bids-validator to fail. But that doesn't mean we should remove the columns! It just means that the participants.json file needs more entries. Could you add an example of adding one of these column descriptions to the participants.json section below?

Good point! I added an explanation.

Suggested change

- others - if available, include also demographic characteristics such as `age`, `sex`, `height`, `weight`, `researcher`, and [additional columns](https://bids-specification.readthedocs.io/en/stable/03-modality-agnostic-files.html#participants-file).

- others - if available, include also demographic characteristics such as `age`, `sex`, `height`, `weight`, `researcher`, and [additional columns](https://bids-specification.readthedocs.io/en/stable/03-modality-agnostic-files.html#participants-file).

❗️If you use some extra column (such as `age`, `sex`, etc), you have to include an appropriate entry also to the `participants.json` file. Otherwise, `bids-validator` fails.

kousu · 2022-12-08T18:38:05Z

I'm very in favour of this over putting data directly in sourcedata/. sourcedata/ is almost never anonymized, for one thing, and will bloat the repository with data people will almost never use. Yes, git annex can skip it, but `git annex get` defaults to getting everything and has no (obvious?) way to exclude a folder so the only option becomes laboriously typing out each subject name one by one. But submodules -- or something equivalent -- would default the other way.

README.md

valosekj · 2022-12-13T21:53:51Z

We went through this PR together with @kousu today. We reviewed it, made some changes, and moved the content to the intranet wiki:
https://intranet.neuro.polymtl.ca/data/dataset-curation.html

This PR adds just a link to the intranet.

kousu

Thank you for putting your head together with me on this today.

valosekj added 2 commits December 6, 2022 14:41

Add BIDS files to README

5f8537a

Add additional column examples for participants.tsv

db43532

valosekj changed the title ~~Add BIDS files to README~~ Add details on which files should be included in BIDS dataset Dec 6, 2022

kousu reviewed Dec 6, 2022

View reviewed changes