Make `original subject`/`original object` plural, multivalued #1574

colleenXu · 2025-06-11T18:10:51Z

Currently, original subject and original object are singular - which implies only a single string value is allowed (but there's no schema elements to reinforce this).

However, our team have encountered cases where edges with different original subjects/objects should maybe be merged into 1 Translator edge. This most commonly happens during node normalization (ex: resource has two edges with different subject IDs, but Translator NodeNorm puts both IDs in the same entity). We often think it'd be appropriate to merge the original edges, but we'd like to keep all the original subject/object info. This would likely be relevant for many resource ingests.

This change makes it explicitly clear that multiple values are allowed for original subjects and original objects in these edge-merging cases.

(Note: there's also an automated job that ran in my fork's master branch that's included in this PR)

colleenXu · 2025-06-11T18:14:26Z

@sierra-moxon

I've noticed another ambiguity with these properties: it's not clear to me if the values should be CURIEs with Translator/bioregistry prefixes or not. What do you think?

There are plenty of cases where the original resource's subject/object actually weren't CURIEs or used different prefixes/delimiters. It'd be more "true to original source" to keep them this way, but it may be harder to interpret (ex: the resource only uses the numeric ID, but you'd need to dig to find the actual prefix/namespace).

sierra-moxon · 2025-06-12T16:20:22Z

@colleenXu - to clarify my understanding in the case of node normalization:

A resource has two edges with different subject IDs, but Translator NodeNorm puts both IDs in the same different/third ID?

colleenXu · 2025-06-20T07:16:21Z

The primary ID of the NodeNorm clique/entity can be a third ID, but I don't think it has to be. It just has to be the same ID.

Example where it wouldn't be a "third":

A resource has two edges with different subject IDs, CHEBI:10093 (Yohimbine) and PUBCHEM.COMPOUND:6169 (Yohimbine Hydrochloride). NodeNorm CI maps both to the same clique/entity, with the primary ID CHEBI:10093 (the first edge's).

The ingester thinks it makes sense/is easy to merge these two edges - but they want to record the original subject IDs. I think it makes sense to include both (even if one of them ends up being the primary ID).

(sorry for the late reply)

sierra-moxon · 2025-07-01T16:21:04Z

Two concerns are bumping around in my head about this change:

multivalued original_subject|object means that we could be obfuscating a two-step merge either from two sources making one modular ingest, or from two edges within the same source that have different id spaces. Can you give me specific examples of sources like this so I can understand better if this concern is founded in reality or not?
from a perspective of someone outside of the author of the modular ingest, seeing more than one value there would lead me to ask the question above -- what happened here to give me more than one original subject? (and then I would look at the metadata on the edge to see if there was a trail of provenance for the merge, but this might have to be a manual examination all the time because at that point, I might be comparing a simple list with another list (e.g. multiple aggregator knowledge sources?), or a simple list with a couple of other proerties (one aggregator KS and one primary KS).

Do you think we could clarify in the description of these fields what a multivalued original_subject|object should mean (hopefully just one definition)? Or, could you suggest additional slots or metadata fields that would help us?

colleenXu · 2025-07-02T03:15:33Z

I'm sorry for the length of this post - my thoughts went down several rabbit-holes, and I tried to organize and shorten them as much as possible.

On concern 1

My thinking has been that these properties are only used by ingesters creating single knowledge-source ingests, when edges come from the same source and have the same KL/AT values (and subject/predicate/qualifier set/object).

From my POV, we don't need to worry over how these properties behave during "merging edges from two different sources" because we (especially ingesters working on single-knowledge-source ingests) wouldn't make these kinds of merges:

there can only be 1 primary_knowledge_source, 1 KL, and 1 AT value on an edge, which prevents merging of the "same edge" from different sources
in this effort, we are trying not to ingest "duplicate info" from the same underlying source as much as possible
I don't imagine later non-DINGO work (putting data into Tiers 1/0, use by ARAs / O&O / ARS) would involve merging edges from different sources and mutating edge properties.

I also don't have any examples off the top of my head of "merging edges from two different sources".

As for merging "two edges within the same source that have different id spaces", I'd like to note: the equivalent source IDs can also be from the same namespace. In fact, I don't have any recent real source-data examples for "different ID spaces" because I've only worked with single namespaces in DISEASES and EBI gene2pheno work.

On concern 2

Based on my reply to concern 1 above, I'd say "the multiple values come from the source ingest". However, the source ingest isn't always the primary_knowledge_source (even though I think we're also trying to ingest primary as much as possible rather than get the data from aggregators?).

I think this concern isn't unique to these properties and points to a deeper issue: did an edge property's values come from Translator curation (ex: KL/AT) or the source ingested? And what is the source ingested in the chain of provenance (sometimes primary, aggregator, or maybe even supporting)? I imagine this is what TRAPI's attribute_source was meant to address.

On the last paragraph

Yes, I think once we're on the same page, we could add more detail/guidance on usage to the field descriptions.

I'm not sure about additional slots/metadata. The ideas I have are:

adding a slot to RetrievalSource to flag the actual source ingested
in the reference ingest guide or other file, point out where each edge property comes from (I think this has been discussed previously, but isn't in the current reference ingest guide proposal?)
rehashing the discussion of TRAPI AMF attribute_source. I'm not enthusiastic on doing this

colleenXu · 2025-07-02T03:20:57Z

And a follow-up to my earlier comment on value format:

I'm now coming around to "the values should be Translator CURIEs, even if the source's ID format was different". You mentioned "different id spaces" in your comment, and I realized that I wasn't taking this into account. If equivalent IDs are from different ID namespaces, you'd want to know what the namespaces are for each ID (especially if the IDs are the same format, e.g. numeric).

sierra-moxon · 2025-07-03T17:48:29Z

This isn't a complete response, apologies, but it would really help me to see two edges with different subjects/objects from the same source that need to be merged as a result of node normalization. I know this is a true use case, but I am getting a bit tangled in your subsequent clarifications. I think this will be a helpful use case for spec'ing out the "location of the NN step" in the new data ingest pipeline as well.

sierra-moxon · 2025-07-03T17:49:11Z

I'd like to understand the difference between "original_subject|object" and the collection of alternative ids in a node norm response.

colleenXu · 2025-07-17T17:54:08Z

Sorry for the late response! When I looked for examples in my DISEASES re-ingest, I discovered some NodeNorm issues. So I had to extract and review all the cases first (and share my findings with NodeNorm - internal Slack link).

Here's some real, valid examples from my DISEASES re-ingest:

1: different objects map to the same NodeNorm entity ("Autoimmune neuropathy" and "Autoimmune peripheral neuropathy")

The original rows/edges from the DISEASES text-mining file have different disease IDs/names and edge properties:

gene_id	gene_name	disease_id	disease_name	z_score	confidence_score	url
ENSP00000376048	MAG	DOID:0060499	Autoimmune neuropathy	4.998	2.499	https://diseases.jensenlab.org/Entity?documents=10&type1=9606&id1=ENSP00000376048&type2=-26&id2=DOID:0060499
ENSP00000376048	MAG	DOID:0040087	Autoimmune peripheral neuropathy	4.177	2.089	https://diseases.jensenlab.org/Entity?documents=10&type1=9606&id1=ENSP00000376048&type2=-26&id2=DOID:0040087

NodeNorm CI maps both IDs to the same disease MONDO:0000774 (autoimmune neuropathy). This matches Monarch's mappings/"Also known as".

2: another example of different objects ("Cannabis dependence" and "Cannabis abuse")

The original rows/edges from the DISEASES text-mining file have different disease IDs/names and edge properties:

gene_id	gene_name	disease_id	disease_name	z_score	confidence_score	url
ENSP00000427603	GABRA2	DOID:1849	Cannabis dependence	5.000	2.500	https://diseases.jensenlab.org/Entity?documents=10&type1=9606&id1=ENSP00000427603&type2=-26&id2=DOID:1849
ENSP00000427603	GABRA2	DOID:9505	Cannabis abuse	3.537	1.769	https://diseases.jensenlab.org/Entity?documents=10&type1=9606&id1=ENSP00000427603&type2=-26&id2=DOID:9505

NodeNorm CI maps both IDs to the same disease MONDO:0005689 (cannabis abuse). This matches Monarch's mappings/"Also known as".

3: different subjects (ENSP IDs) map to the same NodeNorm entity

The original rows/edges from the DISEASES knowledge file only differ by the "gene" ID and name (these are actually ENSEMBL protein IDs, ENSP):

gene_id	gene_name	disease_id	disease_name	source_db	evidence_type	confidence_score
ENSP00000370083	SMN1	DOID:12377	Spinal muscular atrophy	MedlinePlus	CURATED	5
ENSP00000370119	SMN2	DOID:12377	Spinal muscular atrophy	MedlinePlus	CURATED	5

NodeNorm CI maps both IDs to the same protein UniProtKB:Q16637-1 (survival motor neuron protein) because UniProt treats the protein products of genes SMN1 and SMN2 as the same protein (it's a case of gene copies).

colleenXu · 2025-07-17T17:54:39Z

And on…

I'd like to understand the difference between "original_subject|object" and the collection of alternative ids in a node norm response.

I'm assuming the "collection of alternative ids" are node properties?

My team (BTE/Service Provider) used the node property biolink:xref to record ALL the equivalent_identifiers of the NodeNorm entity. I think edge properties are more appropriate (compared to node properties) for recording the actual IDs used in the original data that makes up the edge. VS nodes can be used by various edges and by many sources (in an aggregator resource or in Tier 0), so a node property would leave ambiguity on which "original IDs" were used by a specific edge's original data.

(Note: looks like different ARAs format/use different attribute_type_ids for these)

colleenXu changed the title ~~Make original subject/object plural, multivalued~~ Make original subject/original object plural, multivalued Jun 11, 2025

colleenXu added 6 commits July 1, 2025 11:30

change original subject/object to plural, multivalued

39738c7

add sources: edge property used in TRAPI

916d7b0

minor edits to existing properties of retrieval source

4e58714

add source record urls to RetrievalSource: to align with TRAPI

600cb4b

add sources to Association slots (aka edge properties)

aa48308

shorten lines with 'line too long' warnings

3adaf64

colleenXu force-pushed the plural-original-subject-object branch from 9f31b4d to 3adaf64 Compare July 1, 2025 18:31

Merge branch 'master' into plural-original-subject-object

320127c

Merge branch 'master' into plural-original-subject-object

22daf99

sierra-moxon added 4 commits July 8, 2025 14:11

Merge branch 'master' into plural-original-subject-object

2257680

Merge branch 'master' into plural-original-subject-object

97e3278

Merge branch 'master' into plural-original-subject-object

82c94b6

Merge branch 'master' into plural-original-subject-object

f755bdb

colleenXu mentioned this pull request Sep 19, 2025

Normalization step failure-collection/review ideas NCATSTranslator/translator-ingests#49

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make `original subject`/`original object` plural, multivalued #1574

Make `original subject`/`original object` plural, multivalued #1574

Uh oh!

colleenXu commented Jun 11, 2025 •

edited

Loading

Uh oh!

colleenXu commented Jun 11, 2025 •

edited

Loading

Uh oh!

sierra-moxon commented Jun 12, 2025

Uh oh!

colleenXu commented Jun 20, 2025

Uh oh!

sierra-moxon commented Jul 1, 2025

Uh oh!

colleenXu commented Jul 2, 2025 •

edited

Loading

Uh oh!

colleenXu commented Jul 2, 2025

Uh oh!

sierra-moxon commented Jul 3, 2025

Uh oh!

sierra-moxon commented Jul 3, 2025

Uh oh!

colleenXu commented Jul 17, 2025

Uh oh!

colleenXu commented Jul 17, 2025

Uh oh!

Uh oh!

Make original subject/original object plural, multivalued #1574

Are you sure you want to change the base?

Make original subject/original object plural, multivalued #1574

Uh oh!

Conversation

colleenXu commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

colleenXu commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sierra-moxon commented Jun 12, 2025

Uh oh!

colleenXu commented Jun 20, 2025

Uh oh!

sierra-moxon commented Jul 1, 2025

Uh oh!

colleenXu commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

colleenXu commented Jul 2, 2025

Uh oh!

sierra-moxon commented Jul 3, 2025

Uh oh!

sierra-moxon commented Jul 3, 2025

Uh oh!

colleenXu commented Jul 17, 2025

Uh oh!

colleenXu commented Jul 17, 2025

Uh oh!

Uh oh!

Make `original subject`/`original object` plural, multivalued #1574

Make `original subject`/`original object` plural, multivalued #1574

colleenXu commented Jun 11, 2025 •

edited

Loading

colleenXu commented Jun 11, 2025 •

edited

Loading

colleenXu commented Jul 2, 2025 •

edited

Loading