Faster MoE token generation on CUDA #248

ikawrakow · 2025-03-09T13:50:54Z

This PR adds special purpose matrix-vector multiplications for MoE models.

For DeepSeek-Lite this results in a ~25% speedup for token generation.

For now only implemented ~~with the -fmoe option and only~~ for quantized experts.

Iwan Kawrakow added 3 commits March 9, 2025 16:54

This gives us ~20% TG speedup for DeepSeek on CUDA

adf8e2a

Slightly better

461c319

Also do it for plain (not fused) mul_mat_id

90ab066

ikawrakow force-pushed the ik/cuda_faster_moe_tg branch from cb1636b to 90ab066 Compare March 9, 2025 14:56

Guard against numerical precision issues for MLA on CUDA

cfec338

ikawrakow merged commit 699c9cb into main Mar 10, 2025

Provide feedback