Skip to content

Conversation

colleenXu
Copy link
Contributor

@colleenXu colleenXu commented Jun 11, 2025

Currently, original subject and original object are singular - which implies only a single string value is allowed (but there's no schema elements to reinforce this).

However, our team have encountered cases where edges with different original subjects/objects should maybe be merged into 1 Translator edge. This most commonly happens during node normalization (ex: resource has two edges with different subject IDs, but Translator NodeNorm puts both IDs in the same entity). We often think it'd be appropriate to merge the original edges, but we'd like to keep all the original subject/object info. This would likely be relevant for many resource ingests.

This change makes it explicitly clear that multiple values are allowed for original subjects and original objects in these edge-merging cases.

(Note: there's also an automated job that ran in my fork's master branch that's included in this PR)

@colleenXu
Copy link
Contributor Author

colleenXu commented Jun 11, 2025

@sierra-moxon

I've noticed another ambiguity with these properties: it's not clear to me if the values should be CURIEs with Translator/bioregistry prefixes or not. What do you think?

There are plenty of cases where the original resource's subject/object actually weren't CURIEs or used different prefixes/delimiters. It'd be more "true to original source" to keep them this way, but it may be harder to interpret (ex: the resource only uses the numeric ID, but you'd need to dig to find the actual prefix/namespace).

@colleenXu colleenXu changed the title Make original subject/object plural, multivalued Make original subject/original object plural, multivalued Jun 11, 2025
@sierra-moxon
Copy link
Member

@colleenXu - to clarify my understanding in the case of node normalization:

A resource has two edges with different subject IDs, but Translator NodeNorm puts both IDs in the same different/third ID?

@colleenXu
Copy link
Contributor Author

The primary ID of the NodeNorm clique/entity can be a third ID, but I don't think it has to be. It just has to be the same ID.

Example where it wouldn't be a "third":

A resource has two edges with different subject IDs, CHEBI:10093 (Yohimbine) and PUBCHEM.COMPOUND:6169 (Yohimbine Hydrochloride). NodeNorm CI maps both to the same clique/entity, with the primary ID CHEBI:10093 (the first edge's).

The ingester thinks it makes sense/is easy to merge these two edges - but they want to record the original subject IDs. I think it makes sense to include both (even if one of them ends up being the primary ID).

(sorry for the late reply)

@sierra-moxon
Copy link
Member

Two concerns are bumping around in my head about this change:

  1. multivalued original_subject|object means that we could be obfuscating a two-step merge either from two sources making one modular ingest, or from two edges within the same source that have different id spaces. Can you give me specific examples of sources like this so I can understand better if this concern is founded in reality or not?

  2. from a perspective of someone outside of the author of the modular ingest, seeing more than one value there would lead me to ask the question above -- what happened here to give me more than one original subject? (and then I would look at the metadata on the edge to see if there was a trail of provenance for the merge, but this might have to be a manual examination all the time because at that point, I might be comparing a simple list with another list (e.g. multiple aggregator knowledge sources?), or a simple list with a couple of other proerties (one aggregator KS and one primary KS).

Do you think we could clarify in the description of these fields what a multivalued original_subject|object should mean (hopefully just one definition)? Or, could you suggest additional slots or metadata fields that would help us?

@colleenXu colleenXu force-pushed the plural-original-subject-object branch from 9f31b4d to 3adaf64 Compare July 1, 2025 18:31
@colleenXu
Copy link
Contributor Author

colleenXu commented Jul 2, 2025

I'm sorry for the length of this post - my thoughts went down several rabbit-holes, and I tried to organize and shorten them as much as possible.

On concern 1

My thinking has been that these properties are only used by ingesters creating single knowledge-source ingests, when edges come from the same source and have the same KL/AT values (and subject/predicate/qualifier set/object).

From my POV, we don't need to worry over how these properties behave during "merging edges from two different sources" because we (especially ingesters working on single-knowledge-source ingests) wouldn't make these kinds of merges:

  • there can only be 1 primary_knowledge_source, 1 KL, and 1 AT value on an edge, which prevents merging of the "same edge" from different sources
  • in this effort, we are trying not to ingest "duplicate info" from the same underlying source as much as possible
  • I don't imagine later non-DINGO work (putting data into Tiers 1/0, use by ARAs / O&O / ARS) would involve merging edges from different sources and mutating edge properties.

I also don't have any examples off the top of my head of "merging edges from two different sources".

As for merging "two edges within the same source that have different id spaces", I'd like to note: the equivalent source IDs can also be from the same namespace. In fact, I don't have any recent real source-data examples for "different ID spaces" because I've only worked with single namespaces in DISEASES and EBI gene2pheno work.

On concern 2

Based on my reply to concern 1 above, I'd say "the multiple values come from the source ingest". However, the source ingest isn't always the primary_knowledge_source (even though I think we're also trying to ingest primary as much as possible rather than get the data from aggregators?).

I think this concern isn't unique to these properties and points to a deeper issue: did an edge property's values come from Translator curation (ex: KL/AT) or the source ingested? And what is the source ingested in the chain of provenance (sometimes primary, aggregator, or maybe even supporting)? I imagine this is what TRAPI's attribute_source was meant to address.

On the last paragraph

Yes, I think once we're on the same page, we could add more detail/guidance on usage to the field descriptions.

I'm not sure about additional slots/metadata. The ideas I have are:

  • adding a slot to RetrievalSource to flag the actual source ingested
  • in the reference ingest guide or other file, point out where each edge property comes from (I think this has been discussed previously, but isn't in the current reference ingest guide proposal?)
  • rehashing the discussion of TRAPI AMF attribute_source. I'm not enthusiastic on doing this

@colleenXu
Copy link
Contributor Author

And a follow-up to my earlier comment on value format:

I'm now coming around to "the values should be Translator CURIEs, even if the source's ID format was different". You mentioned "different id spaces" in your comment, and I realized that I wasn't taking this into account. If equivalent IDs are from different ID namespaces, you'd want to know what the namespaces are for each ID (especially if the IDs are the same format, e.g. numeric).

@sierra-moxon
Copy link
Member

This isn't a complete response, apologies, but it would really help me to see two edges with different subjects/objects from the same source that need to be merged as a result of node normalization. I know this is a true use case, but I am getting a bit tangled in your subsequent clarifications. I think this will be a helpful use case for spec'ing out the "location of the NN step" in the new data ingest pipeline as well.

@sierra-moxon
Copy link
Member

I'd like to understand the difference between "original_subject|object" and the collection of alternative ids in a node norm response.

@colleenXu
Copy link
Contributor Author

Sorry for the late response! When I looked for examples in my DISEASES re-ingest, I discovered some NodeNorm issues. So I had to extract and review all the cases first (and share my findings with NodeNorm - internal Slack link).

Here's some real, valid examples from my DISEASES re-ingest:

1: different objects map to the same NodeNorm entity ("Autoimmune neuropathy" and "Autoimmune peripheral neuropathy")

The original rows/edges from the DISEASES text-mining file have different disease IDs/names and edge properties:

gene_id gene_name disease_id disease_name z_score confidence_score url
ENSP00000376048 MAG DOID:0060499 Autoimmune neuropathy 4.998 2.499 https://diseases.jensenlab.org/Entity?documents=10&type1=9606&id1=ENSP00000376048&type2=-26&id2=DOID:0060499
ENSP00000376048 MAG DOID:0040087 Autoimmune peripheral neuropathy 4.177 2.089 https://diseases.jensenlab.org/Entity?documents=10&type1=9606&id1=ENSP00000376048&type2=-26&id2=DOID:0040087

NodeNorm CI maps both IDs to the same disease MONDO:0000774 (autoimmune neuropathy). This matches Monarch's mappings/"Also known as".

2: another example of different objects ("Cannabis dependence" and "Cannabis abuse")

The original rows/edges from the DISEASES text-mining file have different disease IDs/names and edge properties:

gene_id gene_name disease_id disease_name z_score confidence_score url
ENSP00000427603 GABRA2 DOID:1849 Cannabis dependence 5.000 2.500 https://diseases.jensenlab.org/Entity?documents=10&type1=9606&id1=ENSP00000427603&type2=-26&id2=DOID:1849
ENSP00000427603 GABRA2 DOID:9505 Cannabis abuse 3.537 1.769 https://diseases.jensenlab.org/Entity?documents=10&type1=9606&id1=ENSP00000427603&type2=-26&id2=DOID:9505

NodeNorm CI maps both IDs to the same disease MONDO:0005689 (cannabis abuse). This matches Monarch's mappings/"Also known as".

3: different subjects (ENSP IDs) map to the same NodeNorm entity

The original rows/edges from the DISEASES knowledge file only differ by the "gene" ID and name (these are actually ENSEMBL protein IDs, ENSP):

gene_id gene_name disease_id disease_name source_db evidence_type confidence_score
ENSP00000370083 SMN1 DOID:12377 Spinal muscular atrophy MedlinePlus CURATED 5
ENSP00000370119 SMN2 DOID:12377 Spinal muscular atrophy MedlinePlus CURATED 5

NodeNorm CI maps both IDs to the same protein UniProtKB:Q16637-1 (survival motor neuron protein) because UniProt treats the protein products of genes SMN1 and SMN2 as the same protein (it's a case of gene copies).

@colleenXu
Copy link
Contributor Author

And on…

I'd like to understand the difference between "original_subject|object" and the collection of alternative ids in a node norm response.

I'm assuming the "collection of alternative ids" are node properties?

My team (BTE/Service Provider) used the node property biolink:xref to record ALL the equivalent_identifiers of the NodeNorm entity. I think edge properties are more appropriate (compared to node properties) for recording the actual IDs used in the original data that makes up the edge. VS nodes can be used by various edges and by many sources (in an aggregator resource or in Tier 0), so a node property would leave ambiguity on which "original IDs" were used by a specific edge's original data.

(Note: looks like different ARAs format/use different attribute_type_ids for these)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants