Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -146,4 +146,5 @@ testSQL.py
/dist/
/static/
/templates/

.idea/
149 changes: 149 additions & 0 deletions src/llm/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@

# 📘 EC & SBO Comment Vectorization - README

## 🧩 Overview

This project provides scripts for vectorizing enzyme and sbo comment data using a pre-trained sentence embedding model from the [Sentence-Transformers](https://www.sbert.net/) library.

- **`ec_vector.py`**: Vectorizes EC (Enzyme Commission) comment data
- **`sbo_vector.py`**: Vectorizes SBO (Systems Biology Ontology) comment data

Both scripts filter and process records that have non-empty comments, convert them into dense vector representations, and save the results in multiple formats for further analysis, visualization, or machine learning tasks.

---

## 📊 Data Sources

### EC Data
- **Source**: Downloaded from [ExploreEnz](https://www.enzyme-database.org/) database
- **Processing**: Filtered in MySQL to include only EC entries with non-empty comments
- **Input File**: `entry_with_comments_202507250622.csv`

### SBO Data
- **Source**: Fetched from local sbo json file:SBO_OBO_20230516_110919.json
- **Processing Scripts**:
- `sbo_insert.py`: extracts SBO ontology terms from a JSON file, performs a depth-first search traversal starting from a root node, and generates SQL insert statements for database storage while verifying the results against ground truth data.
- `insert_sbo.json.sql`: SQL file containing INSERT statements for SBO terms with comments
- **Input File**: `sbo_terms_202507292305.csv`

---

## 📂 Input Format

### EC Data Input
The EC input CSV file must contain the following columns:
- `ec_num`: Enzyme Commission number (e.g., `"1.1.1.1"`)
- `comments`: Descriptive comment text associated with the EC number

### SBO Data Input
The SBO input CSV file contain the following columns:
- `id`: SBO identifier (e.g., `"SBO:0000176"`)
- `name`: SBO term name
- `comment`: Descriptive comment text associated with the SBO term
- `is_leaf`: Boolean indicating if the term is a leaf node

---

## ⚙️ What the scripts do

### `ec_vector.py`
1. Loads the EC CSV using `pandas`.
2. Uses `SentenceTransformer('all-MiniLM-L6-v2')` to vectorize the `comments` column.
3. Constructs a result dictionary and a browsable DataFrame with embedding vectors.
4. Saves the outputs to multiple files:
- `.pkl`: Full Python dictionary with vectors and metadata
- `.npy`: Raw NumPy matrix for fast loading
- `.csv`: Full table with all metadata and embedding values per dimension

### `sbo_vector.py`
1. Loads the SBO CSV using `pandas`.
2. Uses `SentenceTransformer('all-MiniLM-L6-v2')` to vectorize the `comment` column.
3. Constructs a result dictionary and a browsable DataFrame with embedding vectors.
4. Saves the outputs in the same format as EC vectorization.

---

## 📄 Output Files

### EC Output Files

#### ✅ `ec_comments_vectors.pkl`
- A Python `dict` serialized via `pickle`, containing:
- `'ec_numbers'`: List of EC numbers
- `'comments'`: List of corresponding comments
- `'embeddings'`: A NumPy array of shape `(N, 384)` representing each comment vector

#### ✅ `ec_embeddings.npy`
- A NumPy `.npy` file containing only the embeddings: shape `(N, 384)`

#### ✅ `ec_vectorization_results.csv`
Tabular file with columns:
- `ec_num`: EC number
- `comments`: Functional description
- `embedding_dim`: Embedding dimensionality (384)
- `vector_norm`: L2 norm of the embedding vector
- `dim_0`...`dim_383`: The actual embedding vector dimensions

### SBO Output Files

#### ✅ `sbo_comments_vectors.pkl`
- A Python `dict` serialized via `pickle`, containing:
- `'sbo_ids'`: List of SBO identifiers
- `'names'`: List of SBO term names
- `'comments'`: List of corresponding comments
- `'is_leaf'`: List of boolean values indicating leaf nodes
- `'embeddings'`: A NumPy array of shape `(N, 384)` representing each comment vector

#### ✅ `sbo_embeddings.npy`
- A NumPy `.npy` file containing only the embeddings: shape `(N, 384)`

#### ✅ `sbo_vectorization_results.csv`
Tabular file with columns:
- `id`: SBO identifier
- `comments`: Descriptive comment
- `embedding_dim`: Embedding dimensionality (384)
- `vector_norm`: L2 norm of the embedding vector
- `dim_0`...`dim_383`: The actual embedding vector dimensions

---

## 🧠 Model Details

- Model used: [`all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
- Embedding dimension: `384`
- Encoding type: Mean pooling over BERT token embeddings

---

## 🛠️ How to Use

### For EC Vectorization:
```bash
python ec_vector.py
```

Make sure the input CSV is correctly set inside the script:
```python
csv_file = "entry_with_comments_202507250622.csv"
```

### For SBO Vectorization:
```bash
python sbo_vector.py
```

Make sure the input CSV is correctly set inside the script:
```python
csv_file = "sbo_terms_202507292305.csv"
```

Output files will be saved in the same directory as the scripts.

---

## 📌 Notes

- Only entries **with non-empty comments** are processed for both EC and SBO data.
- Both scripts use the same embedding model for consistency.
- The `vector_norm` column can be used to analyze how "informative" each comment is in embedding space.
- SBO data includes additional metadata like `name` and `is_leaf` status compared to EC data.
Empty file added src/llm/__init__.py
Empty file.
Binary file added src/llm/ec_comments_vectors.pkl
Binary file not shown.
Binary file added src/llm/ec_embeddings.npy
Binary file not shown.
127 changes: 127 additions & 0 deletions src/llm/ec_vector.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# import pandas as pd
# import numpy as np
# from sentence_transformers import SentenceTransformer
# import pickle
# import os
#
#
# def vectorize_ec_comments(csv_path, output_dir='vectors'):
# """
# Vectorize EC records that have comments.
# """
# # Create output directory
# os.makedirs(output_dir, exist_ok=True)
#
# # Read the CSV file
# print(f"Reading CSV file: {csv_path}")
# df = pd.read_csv(csv_path)
#
# # Check data structure
# print(f"Data shape: {df.shape}")
# print(f"Columns: {df.columns.tolist()}")
#
# # Filter records with non-empty comments
# df_with_comments = df[df['comments'].notna() & (df['comments'].str.strip() != '')]
# print(f"Number of records with comments: {len(df_with_comments)}")
#
# if len(df_with_comments) == 0:
# print("No records with comments found.")
# return
#
# # Load pretrained SentenceTransformer model
# print("Loading SentenceTransformer model...")
# model = SentenceTransformer('all-MiniLM-L6-v2')
#
# # Vectorize the comments
# print("Vectorizing comments...")
# comments = df_with_comments['comments'].tolist()
# embeddings = model.encode(comments, show_progress_bar=True)
#
# # Save results
# results = {
# 'ec_numbers': df_with_comments['ec_num'].tolist(),
# 'accepted_names': df_with_comments['accepted_name'].tolist(),
# 'reactions': df_with_comments['reaction'].tolist(),
# 'comments': comments,
# 'embeddings': embeddings
# }
#
# # Save the full result as pickle
# with open(os.path.join(output_dir, 'ec_comments_vectors.pkl'), 'wb') as f:
# pickle.dump(results, f)
#
# # Save only the embeddings as .npy
# np.save(os.path.join(output_dir, 'embeddings.npy'), embeddings)
#
# # Save index information as CSV
# index_df = df_with_comments[['ec_num', 'accepted_name']].copy()
# index_df.to_csv(os.path.join(output_dir, 'ec_index.csv'), index=False)
#
# # Done
# print("Vectorization complete!")
# print(f"- Embedding shape: {embeddings.shape}")
# print(f"- Saved in: {output_dir}/")
# print(f"- Files: ec_comments_vectors.pkl, embeddings.npy, ec_index.csv")
#
#
# if __name__ == "__main__":
# # Set your CSV file path here
# csv_file = "enzyme_data.csv" # Replace with your actual file path
#
# if os.path.exists(csv_file):
# vectorize_ec_comments(csv_file)
# else:
# print(f"CSV file does not exist: {csv_file}")
# print("Please generate a CSV file containing EC data first.")

import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
import pickle

# Load the CSV file (assumed to contain 'ec_num' and 'comments' columns)
df = pd.read_csv('entry_with_comments_202507250622.csv') # Replace with your actual filename

# Load the pre-trained SentenceTransformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Vectorize the comments
print(f"Vectorizing comments for {len(df)} EC records...")
embeddings = model.encode(df['comments'].tolist(), show_progress_bar=True)

# Save the results
results = {
'ec_numbers': df['ec_num'].tolist(),
'comments': df['comments'].tolist(),
'embeddings': embeddings
}

with open('ec_comments_vectors.pkl', 'wb') as f:
pickle.dump(results, f)

np.save('ec_embeddings.npy', embeddings)

# Create a DataFrame with vector information for inspection
# results_df = pd.DataFrame({
# 'ec_num': df['ec_num'],
# 'comments': df['comments'],
# 'embedding_dim': [embeddings.shape[1]] * len(df),
# 'vector_norm': np.linalg.norm(embeddings, axis=1)
# })

vector_cols = {f'dim_{i}': embeddings[:, i] for i in range(embeddings.shape[1])}
results_df = pd.DataFrame({
'ec_num': df['ec_num'],
'comments': df['comments'],
'embedding_dim': [embeddings.shape[1]] * len(df),
'vector_norm': np.linalg.norm(embeddings, axis=1),
**vector_cols
})

# Save to CSV
results_df.to_csv('ec_vectorization_results.csv', index=False)
print("CSV file saved: ec_vectorization_results.csv")


print(f"Vectorization complete! Embedding shape: {embeddings.shape}")
print("Saved files: ec_comments_vectors.pkl, ec_embeddings.npy")
Loading