draeger-lab · lareinahu-2023 · Jul 24, 2025 · Jul 29, 2025 · Jul 29, 2025
diff --git a/.gitignore b/.gitignore
@@ -146,4 +146,5 @@ testSQL.py
 /dist/
 /static/
 /templates/
+
 .idea/
diff --git a/src/llm/README.md b/src/llm/README.md
@@ -0,0 +1,149 @@
+
+# 📘 EC & SBO Comment Vectorization - README
+
+## 🧩 Overview
+
+This project provides scripts for vectorizing enzyme and sbo comment data using a pre-trained sentence embedding model from the [Sentence-Transformers](https://www.sbert.net/) library.
+
+- **`ec_vector.py`**: Vectorizes EC (Enzyme Commission) comment data
+- **`sbo_vector.py`**: Vectorizes SBO (Systems Biology Ontology) comment data
+
+Both scripts filter and process records that have non-empty comments, convert them into dense vector representations, and save the results in multiple formats for further analysis, visualization, or machine learning tasks.
+
+---
+
+## 📊 Data Sources
+
+### EC Data
+- **Source**: Downloaded from [ExploreEnz](https://www.enzyme-database.org/) database
+- **Processing**: Filtered in MySQL to include only EC entries with non-empty comments
+- **Input File**: `entry_with_comments_202507250622.csv`
+
+### SBO Data  
+- **Source**: Fetched from local sbo json file：SBO_OBO_20230516_110919.json
+- **Processing Scripts**: 
+  - `sbo_insert.py`: extracts SBO ontology terms from a JSON file, performs a depth-first search traversal starting from a root node, and generates SQL insert statements for database storage while verifying the results against ground truth data.
+  - `insert_sbo.json.sql`: SQL file containing INSERT statements for SBO terms with comments
+- **Input File**: `sbo_terms_202507292305.csv`
+
+---
+
+## 📂 Input Format
+
+### EC Data Input
+The EC input CSV file must contain the following columns:
+- `ec_num`: Enzyme Commission number (e.g., `"1.1.1.1"`)
+- `comments`: Descriptive comment text associated with the EC number
+
+### SBO Data Input
+The SBO input CSV file contain the following columns:
+- `id`: SBO identifier (e.g., `"SBO:0000176"`)
+- `name`: SBO term name
+- `comment`: Descriptive comment text associated with the SBO term
+- `is_leaf`: Boolean indicating if the term is a leaf node
+
+---
+
+## ⚙️ What the scripts do
+
+### `ec_vector.py`
+1. Loads the EC CSV using `pandas`.
+2. Uses `SentenceTransformer('all-MiniLM-L6-v2')` to vectorize the `comments` column.
+3. Constructs a result dictionary and a browsable DataFrame with embedding vectors.
+4. Saves the outputs to multiple files:
+   - `.pkl`: Full Python dictionary with vectors and metadata
+   - `.npy`: Raw NumPy matrix for fast loading
+   - `.csv`: Full table with all metadata and embedding values per dimension
+
+### `sbo_vector.py`
+1. Loads the SBO CSV using `pandas`.
+2. Uses `SentenceTransformer('all-MiniLM-L6-v2')` to vectorize the `comment` column.
+3. Constructs a result dictionary and a browsable DataFrame with embedding vectors.
+4. Saves the outputs in the same format as EC vectorization.
+
+---
+
+## 📄 Output Files
+
+### EC Output Files
+
+#### ✅ `ec_comments_vectors.pkl`
+- A Python `dict` serialized via `pickle`, containing:
+  - `'ec_numbers'`: List of EC numbers
+  - `'comments'`: List of corresponding comments
+  - `'embeddings'`: A NumPy array of shape `(N, 384)` representing each comment vector
+
+#### ✅ `ec_embeddings.npy`
+- A NumPy `.npy` file containing only the embeddings: shape `(N, 384)`
+
+#### ✅ `ec_vectorization_results.csv`
+Tabular file with columns:
+- `ec_num`: EC number
+- `comments`: Functional description
+- `embedding_dim`: Embedding dimensionality (384)
+- `vector_norm`: L2 norm of the embedding vector
+- `dim_0`...`dim_383`: The actual embedding vector dimensions
+
+### SBO Output Files
+
+#### ✅ `sbo_comments_vectors.pkl`
+- A Python `dict` serialized via `pickle`, containing:
+  - `'sbo_ids'`: List of SBO identifiers
+  - `'names'`: List of SBO term names
+  - `'comments'`: List of corresponding comments
+  - `'is_leaf'`: List of boolean values indicating leaf nodes
+  - `'embeddings'`: A NumPy array of shape `(N, 384)` representing each comment vector
+
+#### ✅ `sbo_embeddings.npy`
+- A NumPy `.npy` file containing only the embeddings: shape `(N, 384)`
+
+#### ✅ `sbo_vectorization_results.csv`
+Tabular file with columns:
+- `id`: SBO identifier
+- `comments`: Descriptive comment
+- `embedding_dim`: Embedding dimensionality (384)
+- `vector_norm`: L2 norm of the embedding vector
+- `dim_0`...`dim_383`: The actual embedding vector dimensions
+
+---
+
+## 🧠 Model Details
+
+- Model used: [`all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
+- Embedding dimension: `384`
+- Encoding type: Mean pooling over BERT token embeddings
+
+---
+
+## 🛠️ How to Use
+
+### For EC Vectorization:
+```bash
+python ec_vector.py
+```
+
+Make sure the input CSV is correctly set inside the script:
+```python
+csv_file = "entry_with_comments_202507250622.csv"
+```
+
+### For SBO Vectorization:
+```bash
+python sbo_vector.py
+```
+
+Make sure the input CSV is correctly set inside the script:
+```python
+csv_file = "sbo_terms_202507292305.csv"
+```
+
+Output files will be saved in the same directory as the scripts.
+
+---
+
+## 📌 Notes
+
+- Only entries **with non-empty comments** are processed for both EC and SBO data.
+- Both scripts use the same embedding model for consistency.
+- The `vector_norm` column can be used to analyze how "informative" each comment is in embedding space.
+- SBO data includes additional metadata like `name` and `is_leaf` status compared to EC data.
diff --git a/src/llm/__init__.py b/src/llm/__init__.py
diff --git a/src/llm/ec_comments_vectors.pkl b/src/llm/ec_comments_vectors.pkl
diff --git a/src/llm/ec_embeddings.npy b/src/llm/ec_embeddings.npy
diff --git a/src/llm/ec_vector.py b/src/llm/ec_vector.py
@@ -0,0 +1,127 @@
+# import pandas as pd
+# import numpy as np
+# from sentence_transformers import SentenceTransformer
+# import pickle
+# import os
+#
+#
+# def vectorize_ec_comments(csv_path, output_dir='vectors'):
+#     """
+#     Vectorize EC records that have comments.
+#     """
+#     # Create output directory
+#     os.makedirs(output_dir, exist_ok=True)
+#
+#     # Read the CSV file
+#     print(f"Reading CSV file: {csv_path}")
+#     df = pd.read_csv(csv_path)
+#
+#     # Check data structure
+#     print(f"Data shape: {df.shape}")
+#     print(f"Columns: {df.columns.tolist()}")
+#
+#     # Filter records with non-empty comments
+#     df_with_comments = df[df['comments'].notna() & (df['comments'].str.strip() != '')]
+#     print(f"Number of records with comments: {len(df_with_comments)}")
+#
+#     if len(df_with_comments) == 0:
+#         print("No records with comments found.")
+#         return
+#
+#     # Load pretrained SentenceTransformer model
+#     print("Loading SentenceTransformer model...")
+#     model = SentenceTransformer('all-MiniLM-L6-v2')
+#
+#     # Vectorize the comments
+#     print("Vectorizing comments...")
+#     comments = df_with_comments['comments'].tolist()
+#     embeddings = model.encode(comments, show_progress_bar=True)
+#
+#     # Save results
+#     results = {
+#         'ec_numbers': df_with_comments['ec_num'].tolist(),
+#         'accepted_names': df_with_comments['accepted_name'].tolist(),
+#         'reactions': df_with_comments['reaction'].tolist(),
+#         'comments': comments,
+#         'embeddings': embeddings
+#     }
+#
+#     # Save the full result as pickle
+#     with open(os.path.join(output_dir, 'ec_comments_vectors.pkl'), 'wb') as f:
+#         pickle.dump(results, f)
+#
+#     # Save only the embeddings as .npy
+#     np.save(os.path.join(output_dir, 'embeddings.npy'), embeddings)
+#
+#     # Save index information as CSV
+#     index_df = df_with_comments[['ec_num', 'accepted_name']].copy()
+#     index_df.to_csv(os.path.join(output_dir, 'ec_index.csv'), index=False)
+#
+#     # Done
+#     print("Vectorization complete!")
+#     print(f"- Embedding shape: {embeddings.shape}")
+#     print(f"- Saved in: {output_dir}/")
+#     print(f"- Files: ec_comments_vectors.pkl, embeddings.npy, ec_index.csv")
+#
+#
+# if __name__ == "__main__":
+#     # Set your CSV file path here
+#     csv_file = "enzyme_data.csv"  # Replace with your actual file path
+#
+#     if os.path.exists(csv_file):
+#         vectorize_ec_comments(csv_file)
+#     else:
+#         print(f"CSV file does not exist: {csv_file}")
+#         print("Please generate a CSV file containing EC data first.")
+
+import pandas as pd
+import numpy as np
+from sentence_transformers import SentenceTransformer
+import pickle
+
+# Load the CSV file (assumed to contain 'ec_num' and 'comments' columns)
+df = pd.read_csv('entry_with_comments_202507250622.csv')  # Replace with your actual filename
+
+# Load the pre-trained SentenceTransformer model
+model = SentenceTransformer('all-MiniLM-L6-v2')
+
+# Vectorize the comments
+print(f"Vectorizing comments for {len(df)} EC records...")
+embeddings = model.encode(df['comments'].tolist(), show_progress_bar=True)
+
+# Save the results
+results = {
+    'ec_numbers': df['ec_num'].tolist(),
+    'comments': df['comments'].tolist(),
+    'embeddings': embeddings
+}
+
+with open('ec_comments_vectors.pkl', 'wb') as f:
+    pickle.dump(results, f)
+
+np.save('ec_embeddings.npy', embeddings)
+
+# Create a DataFrame with vector information for inspection
+# results_df = pd.DataFrame({
+#     'ec_num': df['ec_num'],
+#     'comments': df['comments'],
+#     'embedding_dim': [embeddings.shape[1]] * len(df),
+#     'vector_norm': np.linalg.norm(embeddings, axis=1)
+# })
+
+vector_cols = {f'dim_{i}': embeddings[:, i] for i in range(embeddings.shape[1])}
+results_df = pd.DataFrame({
+    'ec_num': df['ec_num'],
+    'comments': df['comments'],
+    'embedding_dim': [embeddings.shape[1]] * len(df),
+    'vector_norm': np.linalg.norm(embeddings, axis=1),
+    **vector_cols
+})
+
+# Save to CSV
+results_df.to_csv('ec_vectorization_results.csv', index=False)
+print("CSV file saved: ec_vectorization_results.csv")
+
+
+print(f"Vectorization complete! Embedding shape: {embeddings.shape}")
+print("Saved files: ec_comments_vectors.pkl, ec_embeddings.npy")
-Original file line number
+Diff line change
@@ Expand Up / @@ -146,4 +146,5 @@ testSQL.py @@
     /dist/
     /static/
     /templates/
     .idea/