-
Notifications
You must be signed in to change notification settings - Fork 187
[ENH][BEP028] Specification update for BEP028 BIDS-Prov #2099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some initial notes
!!! bug | ||
TODO: Environment not currently defined in the context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what to be done about that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an IRI from PROV-O or from any other ontology we should map with Enviroment
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Environment
is a prov:Entity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may want to subtype prov:Entity or at least give an easy solution to differentiate Environment from Input files (which are also represented as entities)...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we do this, i think we go back to the ontology development work that i think we did not end up finishing. perhaps reuse the nidm classes and properties?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can look into NIDM but don't think we had a term for that unfortunately. If we stick to prov:Entity for Environement, then maybe we should have a different signature of REQUIRED
fields in input files compared to environment. I really think it should be possible for a machine to automatically know if something is an input file versus an environment...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realized now that actually in BIDS-Prov, it's easy to differentiate as "Environments" and "Entities" arrays as they are defined using different keywords (although in RDF they both map to prov:Entity) so I think we can keep as is.
Sorry for the noise!
} | ||
) }} | ||
|
||
The `dataset_description.json` file of a BIDS-Derivatives dataset MUST include the following key to describe provenance: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is from the "common-principles", so better be referred to it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you suggest to remove all metadata tables from the Provenance at dataset level section because these metadata are already listed in src/modality-agnostic-files/dataset-description.json
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not duplication, the suggestion is to move all the provenance-related info (including how to write out GeneratedBy) from common-principles into Provenance. See #2099 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks very nice @bclenet - left some comments which should be very lightweight to address.
{ | ||
"Id": "bids::sub-01/func/sub-01_task-tonecounting_bold.nii", | ||
"Label": "sub-01_task-tonecounting_bold.nii", | ||
"AtLocation": "sub-01/func/sub-01_task-tonecounting_bold.nii", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason not to use the same uri as Id
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the Label
, I guess we want to use the name of the file (without directory path) so that is more human readable.
About AtLocation
, we describe this metadata as:
For input files, this is the relative path to the file on disk.
Should we be more specific for this definition ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i meant to use the bids uri here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed w/ Boris today. We could use uri and change the spec to say "this is the path to the file on disk." (instead of "relative path").
But... Now that the id is in fact the path to the file in bids -- maybe we should consider removing AtLocation
from BIDS-Prov as this only creates duplication of information. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the unfortunate bit here is that the bids uri is not persistent. we do need to add something to the id for uniqueness. otherwise two datasets loaded into a graph could have the same id for a same named file (e.g. sub-01/fmri/sub-01_task-rest_bold.nii.gz could be in many datasets). i don't think their id's should be the same. so i would suggest keeping the bids uri for atlocation and then consider how to get uniqueness to the ids here and elsewhere.
for a different project we have been constructing the id using a function of the checksum of the metadata associated with the node. this allows us to generate a graph without creating new nodes each time we run the process. and if the metadata changes (whether in keys and/or their values) a new node id is generated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@satra - could we get uniqueness using the dataset name in the BIDS URI? e.g. bids:ds109:sub-01/fmri/sub-01_task-rest_bold.nii.gz
is unique). I think this may be a BIDS issue more than BIDS-Prov.
Note: we might need another online meeting to discuss those remaining bits? Let me know how you'd like to proceed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately, the dataset name is technically a key
from the dataset_description (see: https://bids-specification.readthedocs.io/en/stable/common-principles.html#examples_1), so not guaranteed to be unique across datasets. bids intentionally decoupled uniqueness or persistence from the key
name and in that section on bids uri notes that while certain aspects of a scheme uri are not being used, it could be used in the future (see: https://bids-specification.readthedocs.io/en/stable/common-principles.html#future-statement).
my suggestion would be to perhaps do what is being done for uniqueness of other ids.
|
||
## Minimal example | ||
|
||
Here is a comprehensive example that considers the following dataset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for these examples, do consider the changes about study-level organization and what that means for prov.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @satra! I discussed this with Boris. Can you give us some more insights on what you mean by "study-level organization". Is this a change in the BIDS spec? Or do you mean a re-organization of the files within a BIDS folder for a given study?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is the relevant doc: https://bids-specification.readthedocs.io/en/stable/common-principles.html#study-dataset this was merged with the recent release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR will allow to describe provenance of any BIDS dataset (may it be in the dataet_description.json, in the prov directory, or in a sidecar). This applies to a study dataset as well as to its nested BIDS datasets. Therefore, @cmaumet and I think there is nothing specific to add about study-level inside the Provenance section of the specification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to clarify (and perhaps to make clear):
- so one could add any provenance to the study level
dataset_description.json
. - any use of bids uris would of course have to reflect cross-dataset links (https://bids-specification.readthedocs.io/en/stable/common-principles.html#examples_1)
- the cross-dataset link for bids-uri could come up in individual datasets as well, where a dataset is used as input and the outputs are some derivatives in a different dataset. in this scenario, would we recommend that a study level dataset_description is required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next batch of comments and recommendations
https://github.com/bids-standard/bids-specification/blob/master/macros_doc.md | ||
--> | ||
{{ MACROS___make_subobject_table("metadata.GeneratedBy.items") }} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these removals is text and headings leave examples "hanging". If all of this is to move to the provenance section, surgery might be needed to become more thorough and have explicit reference to that section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a link to the Provenance section in the description of GeneratedBy
to cover the fact that this table was moved to the Provenance section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed w/ @bclenet today. The description of GeneratedBy
was moved from "Modality-agnostic files/Dataset description" into "Modality-agnostic files/Provenance/Provenance at dataset level". The example itself is a dataset_description.json file and as such it makes sense to keep it in "Modality-agnostic files/Dataset description" where the overall structure of that file is described. So we are not sure why you mention the examples are hanging. Let us know if that answer your comment or if further updates are needed.
|
||
### Principles for encoding provenance in BIDS | ||
|
||
- Provenance information SHOULD be included in a BIDS dataset when possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that GeneratedBy is just MAY, so here it then should also be such to be consistent
- Provenance information SHOULD be included in a BIDS dataset when possible. | |
- Provenance information MAY be included in a BIDS dataset when possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GeneratedBy
is RECOMMENDED in BIDS datasets and REQUIRED in BIDS derivative datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding the sources in BIDS-Spec:
- for BIDS datasets: https://bids-specification.readthedocs.io/en/stable/modality-agnostic-files/dataset-description.html#dataset_descriptionjson
- for derivatives: https://bids-specification.readthedocs.io/en/stable/modality-agnostic-files/dataset-description.html#derived-dataset-and-pipeline-description
So this seems like we can keep SHOULD?
## Provenance at dataset level | ||
|
||
Provenance metadata MAY be stored inside the `dataset_description.json` of any BIDS dataset (or BIDS-Derivatives dataset) it applies to. | ||
This metadata describes the provenance of the whole dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly to SidecarGeneratedBy
vs GeneratedBy
, we have ambiguity as to define the scope, ie - does it apply to all files? What if file had it's own records? We could instruct the same need for a copy but it might be too hard to keep consent etc...
We should spell it out one way or another here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure to fully understand what you mean.
But we can leave Provenance at dataset level as a (minimal) option to describe the provenance of the dataset as an Entity.
Of course this most probably contains partial information, but this does not prevent more precise descriptions from being included inside sidecars.
Hi @effigies , Here are quick questions about the schema modifications in this PR.
I think this is due to the fact that the Also, files inside derivative datasets lead to NOT_INCLUDED / ALL_FILENAME_RULES_HAVE_ISSUES / FILENAME_MISMATCH / ENTITY_WITH_NO_LABEL errors because their names do not conform to BIDS naming. It seems to be contradictory to this part of the spec Could you please give your thoughts about these ? Let me know if I should post these anywhere else. Thanks, |
@satra @yarikoptic: a point on which we'd love your input w/ @bclenet. It seems to us a bit problematic to use the word "Entity" in the Provenance spec for BIDS since entity in BIDS is already defined (and is different from the meaning we give in BIDS-Prov). We could easily use a different word (that would still be mapped to prov:Entity in the context file), but which word... Just found out about "Transput" which seems to be a blanket term for both Inputs and Outputs, would that work for you both? Or any other suggestions? Thanks! |
This is a work in progress PR proposing a specification update for BEP028 BIDS-Prov.
- [ ] being proofread
- [ ] validator error :
/prov/*
NOT_INCLUDED- [ ] validator error :
/prov/*.json
SIDECAR_WITHOUT_DATAFILE- [ ] validator error : derivative files are listed as NOT_INCLUDED / ALL_FILENAME_RULES_HAVE_ISSUES /FILENAME_MISMATCH / ENTITY_WITH_NO_LABEL