Skip to content

Conversation

EreboPSilva
Copy link
Member

@EreboPSilva EreboPSilva commented Sep 23, 2025

Complex structures in the attributes of refseq GFF (shown in example), caused range to be miss-assigned thus failing due to the end coordinate being smalled than the starting one.

After some looking around, turns out the split that originally handles the attribute was not design to manage complex, nested, attributes. In changing that, but keeping a condition for the normal non-nested cases, all instances in my test were handled as expected.

Simple case:
[[rest of the line]]transl_except=(pos:16975745..16975747%2Caa:Other)

Nested case:
[[rest of the line]]transl_except=(pos:complement(join(11655574%2C11 655751..11655752))%2Caa:Other),(pos:complement(11655454..11655456)%2Caa:Other)

Requirements

When creating your Pull request, please fill out the template below:

PR details

Fix for a very specific (and probably not very wide-spreaded edge case.

Include a short description
(See above for details). Amend of a split function that was not covering all possible cases.

Include links to JIRA tickets
No Jira ticket.

Testing

Have you tested it?
Tested it in a small subset of the problematic GFF (the scaffold containing the entity that originally flagged the issue.

Assign to the weekly GitHub reviewer

@JackCurragh

To avoid issues with complex structures in the attributes that caused
ssues down-the-line with the expected structure of subsequent range
assignment.
@EreboPSilva
Copy link
Member Author

Looking into bettering this further.

@EreboPSilva
Copy link
Member Author

The problem here is transl_except (or translation exceptions, such as selenocysteines, stop codons to be ignored, substitutions, ...). This info is in the attributes column in the GFF file.

Essentially, what we want from this, are 3 values: start of exception, end of exception (will tend to be a codon in most cases, so lenght of 3), and type (which tells us the nature of the exception, seleno, substitution, whatnot, ...).

This follows a more or less stable structure in the file I've anylised:

(pos:complement(N..N),aa:TYPE)
(pos:N..N,aa:TYPE)
(pos:N,aa:TYPE) -> For a single nucleotide

Also, a single CDS can have several of this, presented a comma-separated list of the above structures.

The split function that parses the attribute info to get these values now acepts all these structures (worth mentioning that due the format that the core db expects, the single nucleotid will be recorded using the position as both start and end. The "rule" for validation that the script has request that start be less than end +1, so this still fullfils this.

Now, the problem is that there is another possibility. That is that the exception takes place between different coordinates. Speaking biologically, if a codon is spread between 2 exons, the attribute will include something on the lines of:

(pos:join(N..N,N),aa:TYPE)

And different variation of the above, such as N,N..N, or complement versions of it.

The way we hand this information to the DB can't support this (from what I have seen, I could be wrong). So the current version skips and ignore these. I've though of skipping just the exception to translation, but I fear if we do we'll cause issues down the line when our own pipeline tries to translate something and comes up with a stop codon, or something...

The point being, I need to investigate further how this info in handled down the line after it's handed to the DB. If it's fine we simply add a new format for these cases, if it messes up with something we keep skiping these. There represent a very small percentage of the GFF so the loss is minimal.

to accomodate more cases and exclude the joins that cause issues
@EreboPSilva
Copy link
Member Author

Pushed last changes, added a long explanation of the problem, what I've solved and what I [sadly] haven't.
It's not pretty but I don;t want to invest more time on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant