Semantic IDs: Training an LLM-Recommender Hybrid

Teaching language models to speak in product IDs and natural language
Writeup | Video Demo | Notebook Demo | Models on HF

An experimental implementation of an LLM-recommender hybrid that can make recommendations via conversation. Unlike traditional approaches that use retrieval or tools, this model natively understands items as part of its vocabulary—it's "bilingual" in English and item IDs.

# The model can seamlessly mix natural language and recommendations
INPUT = "I like animal and cute games. <|rec|>"
>>> "Animal Crossing: New Leaf", "DISNEY INFINITY Starter Pack", "Nintendogs + Cats"

# It can explain its recommendations
INPUT = "I just finished Dragon Quest Heroes II. Suggest another <|rec|> and explain why:"
>>> "Nights of Azure - PlayStation 4"
>>> "Both are action RPGs for PS4 with combat focus and character progression..."

Quick Start

Try the Demo

# Clone the repository
git clone https://github.com/eugeneyan/semantic-ids-llm.git
cd semantic-ids-llm

# Install dependencies with uv
uv sync

# Run the demo notebook
jupyter lab demo.ipynb

Use Pre-trained Models

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the finetuned model
model = AutoModelForCausalLM.from_pretrained(
    "eugeneyan/semantic-id-qwen3-8b-video-games",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(
    "eugeneyan/semantic-id-qwen3-8b-video-games"
)

# The model understands semantic IDs like <|sid_start|><|sid_64|><|sid_313|>...

What Are Semantic IDs?

Traditional recommender systems use meaningless hash IDs (B0040JHNQG) for items. Semantic IDs (<|sid_0|><|sid_256|><|sid_512|><|sid_768|>) are hierarchical tokens that encode item information, where similar items share common prefixes.

This project trains a language model to:

Understand items natively - Items are tokens in the vocabulary, not retrieved entities
Make recommendations - Predict next items based on user history
Converse naturally - Steer recommendations through chat
Explain choices - Reason about why items are similar

The key innovation: One unified model instead of separate search/recommendation/chat systems.

Project Structure

semantic-ids-llm/
├── notebooks/
│   ├── 01-prep-items-and-sequences.ipynb
│   ├── 02-clean-descriptions.ipynb
│   ├── 03-clean-titles.ipynb
│   ├── 04-augment-metadata.ipynb
│   ├── 05-update-items-and-sequences.ipynb
│   ├── 06-get-semantic-ids-per-asin.ipynb
│   ├── 07-get-semantic-ids-to-asin-sequences.ipynb
│   ├── 08-prep-finetuning-data.ipynb
│   ├── 09-evaluate-sasrec-baseline.ipynb
│   └── 10-evaluate-sasrec-semantic.ipynb
├── src/
│   ├── embed_items.py           # Item embedding with Qwen3-0.6B
│   ├── train_rqvae.py           # RQ-VAE for semantic IDs
│   ├── train_sasrec.py          # Baseline recommender
│   ├── train_sasrec_semantic_id.py  # Semantic ID recommender
│   ├── finetune_qwen3_8b_vocab.py   # Vocabulary extension
│   └── finetune_qwen3_8b_full.py    # Full model finetuning
├── demo.ipynb          # Interactive demo
├── pyproject.toml      # Dependencies
└── setup.sh            # GPU instance setup script

Installation

Local Setup

# Using uv
pip install uv
uv sync

# Or using pip
pip install -r requirements.txt

GPU Setup

For training on your GPU instance

chmod +x setup.sh
./setup.sh

Requirements

Python 3.12+
CUDA-capable GPU (8GB+ VRAM for inference, 48GB+ for training)
~50GB disk space for models and data

Examples

Basic Recommendation

# Given user history, recommend next items
INPUT = """<|sid_start|><|sid_64|><|sid_313|><|sid_637|><|sid_768|><|sid_end|>,
           <|sid_start|><|sid_64|><|sid_447|><|sid_706|><|sid_768|><|sid_end|>
           <|rec|>"""

>>> "Mass Effect - Xbox 360"

Natural Language Steering

# Combine preferences with platform constraints
INPUT = "I like scifi and action games for Xbox. <|rec|>"

>>> "Star Wars Knights of the Old Republic - Xbox"
>>> "Halo 4 - Xbox 360"
>>> "Fallout: New Vegas - Ultimate Edition"

Multi-turn Conversation

# Turn 1
USER: "I'm looking for games similar to Mario Kart. <|rec|>"
ASSISTANT: "Sonic & All-Stars Racing", "Need for Speed"

# Turn 2
USER: "How about something similar but for Xbox? <|rec|>"
ASSISTANT: "Forza Motorsport 4", "SSX - Xbox 360"

# Turn 3
USER: "Suggest a bundle name for these"
ASSISTANT: "Ultimate Racing & Arcade Fun Bundle"

Training Pipeline

1. Data Preparation

Using Amazon Reviews 2023 Video Games category:

66k products with rich metadata
79k user purchase sequences (avg length: 6.5 items)
Cleaned with Gemini 2.5 Flash for quality

2. Semantic ID Generation

Train RQ-VAE to convert item embeddings → semantic IDs:

# RQ-VAE with 3 quantization levels + 1 uniqueness level
uv run -m src.train_rqvae

3. Baseline Comparison

Evaluate semantic IDs vs regular IDs:

# Train baseline SASRec
uv run -m src.train_sasrec

# Train semantic ID variant
uv run -m src.train_sasrec_semantic_id

4. Language Model Finetuning

Two-stage finetuning of Qwen3-8B:

# Stage 1: Vocabulary extension (add semantic ID tokens)
uv run -m src.finetune_qwen3_8b_vocab

# Stage 2: Full finetuning
uv run -m src.finetune_qwen3_8b_full

Results

Recommendation Performance

Model	Hit@10	NDCG@10	MRR
Baseline SASRec	0.281	0.154	0.130
Semantic ID SASRec	0.202	0.114	0.101

The semantic ID model trades some accuracy for:

Cold-start handling via shared prefixes
Natural language steerability
Explainability of recommendations

Model Capabilities

What it can do:

Recommend items based on history
Explain recommendations
Handle platform/genre constraints
Name and describe bundles
Multi-turn refinement

Limitations:

Lower precision than specialized recommenders
4x inference cost (4 tokens per item)
Requires careful RQ-VAE tuning

Trained Models

Available on HuggingFace:

eugeneyan/video-games-semantic-ids-mapping - Item mappings
eugeneyan/semantic-id-qwen3-8b-video-games - Finetuned model

Citation

@article{yan2025semantic,
  title={How to Train an LLM-recommender Hybrid that Speaks English & Item IDs},
  author={Yan, Eugene},
  journal={eugeneyan.com},
  year={2025},
  url={https://eugeneyan.com/writing/semantic-ids/}
}

References

Key papers that inspired this work:

TIGER: Recommender Systems with Generative Retrieval (Rajput et al., 2023)
Better Generalization with Semantic IDs (Singh et al., 2024)
RQ-VAE: Residual Quantized VAE (Zeghidour et al., 2021)
SASRec: Self-Attentive Sequential Recommendation (Kang & McAuley, 2018)

Contributing

Contributions welcome! Areas of interest:

Multi-modal semantic IDs (images, audio)
Larger-scale experiments (millions of items)
Alternative quantization methods
Production deployment strategies

License

MIT License - see LICENSE file for details.

Discussion & Support

Writeup: Comments on eugeneyan.com
Twitter/X: @eugeneyan

Built with ♥️ by Eugene Yan | Compute credits courtesy of RunPod

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Semantic IDs: Training an LLM-Recommender Hybrid

Quick Start

Try the Demo

Use Pre-trained Models

What Are Semantic IDs?

Project Structure

Installation

Local Setup

GPU Setup

Requirements

Examples

Basic Recommendation

Natural Language Steering

Multi-turn Conversation

Training Pipeline

1. Data Preparation

2. Semantic ID Generation

3. Baseline Comparison

4. Language Model Finetuning

Results

Recommendation Performance

Model Capabilities

Trained Models

Citation

References

Contributing

License

Discussion & Support

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
notebooks		notebooks
src		src
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
demo.ipynb		demo.ipynb
pyproject.toml		pyproject.toml
setup.sh		setup.sh
uv.lock		uv.lock

License

eugeneyan/semantic-ids-llm

Folders and files

Latest commit

History

Repository files navigation

Semantic IDs: Training an LLM-Recommender Hybrid

Quick Start

Try the Demo

Use Pre-trained Models

What Are Semantic IDs?

Project Structure

Installation

Local Setup

GPU Setup

Requirements

Examples

Basic Recommendation

Natural Language Steering

Multi-turn Conversation

Training Pipeline

1. Data Preparation

2. Semantic ID Generation

3. Baseline Comparison

4. Language Model Finetuning

Results

Recommendation Performance

Model Capabilities

Trained Models

Citation

References

Contributing

License

Discussion & Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages