Skip to content

Conversation

lareinahu-2023
Copy link
Contributor

@lareinahu-2023 lareinahu-2023 commented Sep 8, 2025

Final Report: GSoC ’25

Student Name: Jiahui Hu (Lareina)
Organization: National Resource for Network Biology (NRNB)
Mentors: Nantia Leonidou, Prof. Dr. Andreas Dräger
Project: Enhancing SBOannotator with LLM Integration & Dynamic Term


Overview

This project transforms SBOannotator from a static, hard-coded tool into a dynamic, intelligent system for annotating Systems Biology Ontology (SBO) terms in SBML models. The enhanced system integrates:

  • Real-time SBO term retrieval
  • Multiple enzymology data sources (BiGG, KEGG, Reactome, SEED)
  • Fine-tuned LLM-assisted annotation
  • Python Runtime GUI and desktop GUI with interactive visualization

These improvements significantly boost accuracy and usability while preserving the core rule-based strengths.


Methods

1) Automated SBO File Management

  • Auto-update detection: compare commit SHA at startup.
  • Versioning: maintain timestamped local versions and auto-delete sbo file if it exceeded 2 series
  • Formats: support .obo and .json with a 4-step validation pipeline.
  • User interaction: apply updates, download SBO files, or upload custom SBO files.
  • Integrity: round-trip conversion tests to ensure lossless persistence.

2) Three-Layer Rule-Based Annotation Workflow

Layer 1 — Configuration / Strategy
Let users to configure database with order and number
Layer 2 — Adapter Execution
Unified multi-database adapters for identifier extraction and EC-number lookup:

  • BiGG (direct API), KEGG (regex + REST), SEED (Solr), Reactome (web parsing + QuickGO)
  • Early termination: stop querying once a precise SBO term is found
  • EC-number truncation: normalize at first non-digit for consistency

Layer 3 — LLM Filter
Target only reactions needing disambiguation:

  • Distinct handling for single vs multiple ECs
  • Filter conflicting ECs to avoid ambiguity
  • Log EC and fetch context for LLM input

3) Fine-tuned LLM for EC → SBO

  • Base model: BioBERT (dmis-lab/biobert-base-cased-v1.1)
  • Architecture: 768D encoder → FC (768→384→42) with Focal Loss (~111M params)
  • Two-stage training:
    • Stage-1: 331 expert samples, 80 epochs
    • Stage-2: 6,966 GPT-generated samples, lower LR, 10 epochs (noise adaptation)

4) GUI Application

  • PyQt5 build python runtime GUI with side-by-side pre/post annotation tables and file upload/download
  • Packaged via PyInstaller as a macOS DMG (ships the rule-based pipeline)

Results

  • SBO updates: switched to direct GitHub download (~1 min per update)
  • Coverage: across 108 models, 3,317 reactions upgraded from generic SBO:0000176 to specific terms via multi-database integration
  • Efficiency: mean 432.99 s/model (~7.2 min); initial naive multi-DB flow was ~14 h/model; early termination reduced end-to-end time to ~7 min/model
  • LLM classification: Top-1 accuracy 94.00% (42 classes); Macro-F1 0.4563, Weighted-F1 0.9184; mean confidence 78.83% (median 81.41%, max 88.21%); automatic fallback to Stage-1 when Stage-2 degrades
  • python runtime gui and Dmg app

Constraints & Future Improvement

  • Packaging: DMG includes rule-based workflow; LLM features provided via CLI to avoid heavy runtime dependencies
  • Data quality control: intelligent filtering of GPT-generated training data
  • More expert labels: expand high-quality human annotations for better generalization
  • Fine-tune LLM find SBO for EC-less reactions:

Thank You

Thanks to the SBOannotator community and Google Summer of Code for this opportunity. Special thanks to mentors Nantia Leonidou and Andreas Dräger for guidance and support. I will continue to monitor issues and PRs and look forward to future collaborations.


Quick Links

…he link list in the readme, because they are too large to upload to github
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant