Skip to content

Conversation

bclenet
Copy link
Contributor

@bclenet bclenet commented Apr 10, 2025

This is a work in progress PR proposing a specification update for BEP028 BIDS-Prov.

bclenet and others added 30 commits March 18, 2025 11:06
@effigies effigies added this to the 1.11.0 milestone Aug 27, 2025
Copy link
Collaborator

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some initial notes

Comment on lines 379 to 380
!!! bug
TODO: Environment not currently defined in the context
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what to be done about that?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an IRI from PROV-O or from any other ontology we should map with Enviroment ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Environment is a prov:Entity

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may want to subtype prov:Entity or at least give an easy solution to differentiate Environment from Input files (which are also represented as entities)...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we do this, i think we go back to the ontology development work that i think we did not end up finishing. perhaps reuse the nidm classes and properties?

Copy link
Collaborator

@cmaumet cmaumet Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can look into NIDM but don't think we had a term for that unfortunately. If we stick to prov:Entity for Environement, then maybe we should have a different signature of REQUIRED fields in input files compared to environment. I really think it should be possible for a machine to automatically know if something is an input file versus an environment...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realized now that actually in BIDS-Prov, it's easy to differentiate as "Environments" and "Entities" arrays as they are defined using different keywords (although in RDF they both map to prov:Entity) so I think we can keep as is.

Sorry for the noise!

}
) }}

The `dataset_description.json` file of a BIDS-Derivatives dataset MUST include the following key to describe provenance:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is from the "common-principles", so better be referred to it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you suggest to remove all metadata tables from the Provenance at dataset level section because these metadata are already listed in src/modality-agnostic-files/dataset-description.json ?

Copy link
Collaborator

@cmaumet cmaumet Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not duplication, the suggestion is to move all the provenance-related info (including how to write out GeneratedBy) from common-principles into Provenance. See #2099 (comment)

Copy link
Collaborator

@satra satra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks very nice @bclenet - left some comments which should be very lightweight to address.

{
"Id": "bids::sub-01/func/sub-01_task-tonecounting_bold.nii",
"Label": "sub-01_task-tonecounting_bold.nii",
"AtLocation": "sub-01/func/sub-01_task-tonecounting_bold.nii",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason not to use the same uri as Id here?

Copy link
Contributor Author

@bclenet bclenet Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the Label, I guess we want to use the name of the file (without directory path) so that is more human readable.
About AtLocation, we describe this metadata as:

For input files, this is the relative path to the file on disk.

Should we be more specific for this definition ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i meant to use the bids uri here as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed w/ Boris today. We could use uri and change the spec to say "this is the path to the file on disk." (instead of "relative path").

But... Now that the id is in fact the path to the file in bids -- maybe we should consider removing AtLocation from BIDS-Prov as this only creates duplication of information. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the unfortunate bit here is that the bids uri is not persistent. we do need to add something to the id for uniqueness. otherwise two datasets loaded into a graph could have the same id for a same named file (e.g. sub-01/fmri/sub-01_task-rest_bold.nii.gz could be in many datasets). i don't think their id's should be the same. so i would suggest keeping the bids uri for atlocation and then consider how to get uniqueness to the ids here and elsewhere.

for a different project we have been constructing the id using a function of the checksum of the metadata associated with the node. this allows us to generate a graph without creating new nodes each time we run the process. and if the metadata changes (whether in keys and/or their values) a new node id is generated.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@satra - could we get uniqueness using the dataset name in the BIDS URI? e.g. bids:ds109:sub-01/fmri/sub-01_task-rest_bold.nii.gz is unique). I think this may be a BIDS issue more than BIDS-Prov.

Note: we might need another online meeting to discuss those remaining bits? Let me know how you'd like to proceed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately, the dataset name is technically a key from the dataset_description (see: https://bids-specification.readthedocs.io/en/stable/common-principles.html#examples_1), so not guaranteed to be unique across datasets. bids intentionally decoupled uniqueness or persistence from the key name and in that section on bids uri notes that while certain aspects of a scheme uri are not being used, it could be used in the future (see: https://bids-specification.readthedocs.io/en/stable/common-principles.html#future-statement).

my suggestion would be to perhaps do what is being done for uniqueness of other ids.


## Minimal example

Here is a comprehensive example that considers the following dataset:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for these examples, do consider the changes about study-level organization and what that means for prov.

Copy link
Collaborator

@cmaumet cmaumet Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @satra! I discussed this with Boris. Can you give us some more insights on what you mean by "study-level organization". Is this a change in the BIDS spec? Or do you mean a re-organization of the files within a BIDS folder for a given study?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is the relevant doc: https://bids-specification.readthedocs.io/en/stable/common-principles.html#study-dataset this was merged with the recent release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR will allow to describe provenance of any BIDS dataset (may it be in the dataet_description.json, in the prov directory, or in a sidecar). This applies to a study dataset as well as to its nested BIDS datasets. Therefore, @cmaumet and I think there is nothing specific to add about study-level inside the Provenance section of the specification.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to clarify (and perhaps to make clear):

  1. so one could add any provenance to the study level dataset_description.json.
  2. any use of bids uris would of course have to reflect cross-dataset links (https://bids-specification.readthedocs.io/en/stable/common-principles.html#examples_1)
  3. the cross-dataset link for bids-uri could come up in individual datasets as well, where a dataset is used as input and the outputs are some derivatives in a different dataset. in this scenario, would we recommend that a study level dataset_description is required?

Copy link
Collaborator

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next batch of comments and recommendations

https://github.com/bids-standard/bids-specification/blob/master/macros_doc.md
-->
{{ MACROS___make_subobject_table("metadata.GeneratedBy.items") }}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these removals is text and headings leave examples "hanging". If all of this is to move to the provenance section, surgery might be needed to become more thorough and have explicit reference to that section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a link to the Provenance section in the description of GeneratedBy to cover the fact that this table was moved to the Provenance section.

Copy link
Collaborator

@cmaumet cmaumet Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed w/ @bclenet today. The description of GeneratedBy was moved from "Modality-agnostic files/Dataset description" into "Modality-agnostic files/Provenance/Provenance at dataset level". The example itself is a dataset_description.json file and as such it makes sense to keep it in "Modality-agnostic files/Dataset description" where the overall structure of that file is described. So we are not sure why you mention the examples are hanging. Let us know if that answer your comment or if further updates are needed.


### Principles for encoding provenance in BIDS

- Provenance information SHOULD be included in a BIDS dataset when possible.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that GeneratedBy is just MAY, so here it then should also be such to be consistent

Suggested change
- Provenance information SHOULD be included in a BIDS dataset when possible.
- Provenance information MAY be included in a BIDS dataset when possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GeneratedBy is RECOMMENDED in BIDS datasets and REQUIRED in BIDS derivative datasets.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## Provenance at dataset level

Provenance metadata MAY be stored inside the `dataset_description.json` of any BIDS dataset (or BIDS-Derivatives dataset) it applies to.
This metadata describes the provenance of the whole dataset.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to SidecarGeneratedBy vs GeneratedBy, we have ambiguity as to define the scope, ie - does it apply to all files? What if file had it's own records? We could instruct the same need for a copy but it might be too hard to keep consent etc...

We should spell it out one way or another here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure to fully understand what you mean.

But we can leave Provenance at dataset level as a (minimal) option to describe the provenance of the dataset as an Entity.
Of course this most probably contains partial information, but this does not prevent more precise descriptions from being included inside sidecars.

@effigies effigies added the BEP label Sep 4, 2025
@bclenet
Copy link
Contributor Author

bclenet commented Sep 11, 2025

Hi @effigies ,

Here are quick questions about the schema modifications in this PR.
I tested the BIDS examples listed in the description of this PR against a BIDS validator using the modified schema, but I still encounter errors.

  • Provenance files (.json) inside the prov directory are considered as sidecars, leading to SIDECAR_WITHOUT_DATAFILE error files
  • Provenance files (.json) inside the prov directory are NOT_INCLUDED according to the validator, although

I think this is due to the fact that the src/schema/rules/files/common/modality_agnostic.yaml file I've added has no impact for now.

Also, files inside derivative datasets lead to NOT_INCLUDED / ALL_FILENAME_RULES_HAVE_ISSUES / FILENAME_MISMATCH / ENTITY_WITH_NO_LABEL errors because their names do not conform to BIDS naming. It seems to be contradictory to this part of the spec

Could you please give your thoughts about these ? Let me know if I should post these anywhere else.

Thanks,

@cmaumet
Copy link
Collaborator

cmaumet commented Sep 16, 2025

@satra @yarikoptic: a point on which we'd love your input w/ @bclenet. It seems to us a bit problematic to use the word "Entity" in the Provenance spec for BIDS since entity in BIDS is already defined (and is different from the meaning we give in BIDS-Prov). We could easily use a different word (that would still be mapped to prov:Entity in the context file), but which word...

Just found out about "Transput" which seems to be a blanket term for both Inputs and Outputs, would that work for you both? Or any other suggestions?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants