-
Notifications
You must be signed in to change notification settings - Fork 80
Make original subject
/original object
plural, multivalued
#1574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Make original subject
/original object
plural, multivalued
#1574
Conversation
I've noticed another ambiguity with these properties: it's not clear to me if the values should be CURIEs with Translator/bioregistry prefixes or not. What do you think? There are plenty of cases where the original resource's subject/object actually weren't CURIEs or used different prefixes/delimiters. It'd be more "true to original source" to keep them this way, but it may be harder to interpret (ex: the resource only uses the numeric ID, but you'd need to dig to find the actual prefix/namespace). |
original subject
/original object
plural, multivalued
@colleenXu - to clarify my understanding in the case of node normalization: A resource has two edges with different subject IDs, but Translator NodeNorm puts both IDs in the same different/third ID? |
The primary ID of the NodeNorm clique/entity can be a third ID, but I don't think it has to be. It just has to be the same ID. Example where it wouldn't be a "third": A resource has two edges with different subject IDs, The ingester thinks it makes sense/is easy to merge these two edges - but they want to record the original subject IDs. I think it makes sense to include both (even if one of them ends up being the primary ID). (sorry for the late reply) |
Two concerns are bumping around in my head about this change:
Do you think we could clarify in the description of these fields what a multivalued original_subject|object should mean (hopefully just one definition)? Or, could you suggest additional slots or metadata fields that would help us? |
9f31b4d
to
3adaf64
Compare
I'm sorry for the length of this post - my thoughts went down several rabbit-holes, and I tried to organize and shorten them as much as possible. On concern 1
My thinking has been that these properties are only used by ingesters creating single knowledge-source ingests, when edges come from the same source and have the same KL/AT values (and subject/predicate/qualifier set/object). From my POV, we don't need to worry over how these properties behave during "merging edges from two different sources" because we (especially ingesters working on single-knowledge-source ingests) wouldn't make these kinds of merges:
I also don't have any examples off the top of my head of "merging edges from two different sources". As for merging "two edges within the same source that have different id spaces", I'd like to note: the equivalent source IDs can also be from the same namespace. In fact, I don't have any recent real source-data examples for "different ID spaces" because I've only worked with single namespaces in DISEASES and EBI gene2pheno work. On concern 2
Based on my reply to concern 1 above, I'd say "the multiple values come from the source ingest". However, the source ingest isn't always the primary_knowledge_source (even though I think we're also trying to ingest primary as much as possible rather than get the data from aggregators?). I think this concern isn't unique to these properties and points to a deeper issue: did an edge property's values come from Translator curation (ex: KL/AT) or the source ingested? And what is the source ingested in the chain of provenance (sometimes primary, aggregator, or maybe even supporting)? I imagine this is what TRAPI's On the last paragraph
Yes, I think once we're on the same page, we could add more detail/guidance on usage to the field descriptions. I'm not sure about additional slots/metadata. The ideas I have are:
|
And a follow-up to my earlier comment on value format: I'm now coming around to "the values should be Translator CURIEs, even if the source's ID format was different". You mentioned "different id spaces" in your comment, and I realized that I wasn't taking this into account. If equivalent IDs are from different ID namespaces, you'd want to know what the namespaces are for each ID (especially if the IDs are the same format, e.g. numeric). |
This isn't a complete response, apologies, but it would really help me to see two edges with different subjects/objects from the same source that need to be merged as a result of node normalization. I know this is a true use case, but I am getting a bit tangled in your subsequent clarifications. I think this will be a helpful use case for spec'ing out the "location of the NN step" in the new data ingest pipeline as well. |
I'd like to understand the difference between "original_subject|object" and the collection of alternative ids in a node norm response. |
Sorry for the late response! When I looked for examples in my DISEASES re-ingest, I discovered some NodeNorm issues. So I had to extract and review all the cases first (and share my findings with NodeNorm - internal Slack link). Here's some real, valid examples from my DISEASES re-ingest: 1: different objects map to the same NodeNorm entity ("Autoimmune neuropathy" and "Autoimmune peripheral neuropathy")
The original rows/edges from the DISEASES text-mining file have different disease IDs/names and edge properties:
NodeNorm CI maps both IDs to the same disease MONDO:0000774 (autoimmune neuropathy). This matches Monarch's mappings/"Also known as". 2: another example of different objects ("Cannabis dependence" and "Cannabis abuse")
The original rows/edges from the DISEASES text-mining file have different disease IDs/names and edge properties:
NodeNorm CI maps both IDs to the same disease MONDO:0005689 (cannabis abuse). This matches Monarch's mappings/"Also known as". 3: different subjects (ENSP IDs) map to the same NodeNorm entity
The original rows/edges from the DISEASES knowledge file only differ by the "gene" ID and name (these are actually ENSEMBL protein IDs, ENSP):
NodeNorm CI maps both IDs to the same protein UniProtKB:Q16637-1 (survival motor neuron protein) because UniProt treats the protein products of genes SMN1 and SMN2 as the same protein (it's a case of gene copies). |
And on…
I'm assuming the "collection of alternative ids" are node properties? My team (BTE/Service Provider) used the node property (Note: looks like different ARAs format/use different attribute_type_ids for these) |
Currently,
original subject
andoriginal object
are singular - which implies only a single string value is allowed (but there's no schema elements to reinforce this).However, our team have encountered cases where edges with different original subjects/objects should maybe be merged into 1 Translator edge. This most commonly happens during node normalization (ex: resource has two edges with different subject IDs, but Translator NodeNorm puts both IDs in the same entity). We often think it'd be appropriate to merge the original edges, but we'd like to keep all the original subject/object info. This would likely be relevant for many resource ingests.
This change makes it explicitly clear that multiple values are allowed for
original subjects
andoriginal objects
in these edge-merging cases.(Note: there's also an automated job that ran in my fork's master branch that's included in this PR)