What's Changed
🚀 Features
- support offloading weights & kv_cache for turbomind by @irexyc in #3798
- Add PPU backend support by @guozixu2001 in #3807
- Add turbomind metrics by @lvhan028 in #3811
- PytorchEngine support gpt-oss bf16 by @grimoire in #3820
- support sleep/wakeup for pt engine by @irexyc in #3687
- [ascend] run intern-s1 on A3 by @yao-fengchen in #3831
- Initial gpt-oss support for turbomind by @lzhangzz in #3839
- Support GLM-4-0414 and GLM-4.1V by @CUHKSZzxy in #3846
- support internvl3.5 by @lvhan028 in #3886
- Update turbomind communication library by @lzhangzz in #3736
- MXFP4 support for turbomind GEMM library by @lzhangzz in #3927
- Dispatch MXFP4 weight conversion for sm70 & sm75 by @lzhangzz in #3937
💥 Improvements
- fix: turbomind backend config in cli serve by @PeymanRM in #3784
- remove deprecated codes by @lvhan028 in #3759
- Refactor FP8 MoE GEMM by @lzhangzz in #3795
- Fix build rope params by @grimoire in #3760
- Optimize rmsnorm with head_dim=128 by @grimoire in #3814
- Simplify GEMM interface by @lzhangzz in #3818
- Optimize create_model_inputs and schedule_decoding by @grimoire in #3766
- add remote logs;optimize forward lock by @grimoire in #3737
- support deepgemm new api by @grimoire in #3827
- remove serving with gradio by @lvhan028 in #3829
- Deprecate interactive mode from api_server by @lvhan028 in #3830
- build(docker): Try to optimize docker by @windreamer in #3779
- Make a common chat.py to replace each engine's by @lvhan028 in #3836
- Ray mp engine backend by @grimoire in #3790
- [Feat] support using external ray pg with bundles by @CyCle1024 in #3850
- Remove unused code in PT Engine by @grimoire in #3858
- support logprobs by @grimoire in #3852
- optimize prefill preprocess by @grimoire in #3869
- fix flash-attn bc by @grimoire in #3873
- Graph warmup by @grimoire in #3851
- Improve turbomind's prefix cache by @lvhan028 in #3835
- Support OpenAI compatible parameter max_completion_tokens by @Huarong in #3876
- [ascend] add env to set rt visable by ray and disable warmup by @tangzhiyi11 in #3894
- support cache_max_entry_count >= 1 for Turbomind backend by @lh9171338 in #3913
- adjust default values by @lvhan028 in #3921
- [refactor][chat_template][1/N] adopt tokenizer's apply_chat_template by @lvhan028 in #3845
- use FA 2.8.3 which is compatible with torch 2.8.0 by @lvhan028 in #3936
- refactor ascend Dockerfile by @yao-fengchen in #3926
🐞 Bug fixes
- fix gemma3 by @grimoire in #3772
- fix head_dim=None by @grimoire in #3793
- fix user-specified max_session_len by @grimoire in #3785
- remove 'lmdeploy convert' from CLI by @lvhan028 in #3813
- Fix EP with large batch size by @grimoire in #3808
- fix internvl disable_vision_encoder by @grimoire in #3800
- Align response behavior across both engines by @lvhan028 in #3821
- fix: set text_config.tie_word_embedding = False in qwen2vl by @zenosai in #3824
- Fix v1 comp protocol by @CUHKSZzxy in #3828
- [dlinfer] fix get_backend err by @yao-fengchen in #3847
- Update internvl.py to fix #3528 by @zodiacg in #3837
- fix partial rotary factor by @CUHKSZzxy in #3861
- fix: duplicated token usage in /chat/completions stream mode by @Huarong in #3859
- fix chatting with VLM model via CLI by @lvhan028 in #3862
- fix inference on windows platform by @irexyc in #3865
- fix prebuild on cuda12.8 by @lvhan028 in #3857
- Fix uninitialized members in cuBLAS wrapper by @lzhangzz in #3874
- fix flashmla build for cuda12.4 by @CUHKSZzxy in #3872
- [Fix] ray mp engine on ascend platform by @CyCle1024 in #3877
- fix bug: leaves empty by @Tsundoku958 in #3868
- Fix side effect brought by gpt-oss support by @lvhan028 in #3880
- fix pytorch metrics in mp engine by @CUHKSZzxy in #3882
- Fix stream assert error when wakeup 30+ times by @CyCle1024 in #3883
- fix batched prefill by @grimoire in #3887
- fix side effect brought by #3821 by @lvhan028 in #3888
- check_env in multiprocess by @grimoire in #3879
- fix cli serve --help by @RunningLeon in #3895
- Resolve a crash in the
sleep
endpoint by casting thelevel
parameter from string to int by @irexyc in #3897 - Fix nccl for docker cu11 by @RunningLeon in #3896
- disable check_env in multiprocess on dlinfer devices by @tangzhiyi11 in #3914
- [dlinfer] fix nn layout typo and scale t by @yuchiwang in #3915
- fix chat and warmup of lora adapter by @grimoire in #3911
- build(acsend): try to fix acsend CI docker build by @windreamer in #3906
- fix internvl3 hf by @CUHKSZzxy in #3932
- build(docker): fix ascend tag name by @windreamer in #3939
- put eot_token to stop_words by @lvhan028 in #3941
📚 Documentations
- update proxy docs by @CUHKSZzxy in #3796
- add missing docs by @CUHKSZzxy in #3871
- fix docs by @CUHKSZzxy in #3885
- update news and citation by @lvhan028 in #3889
🌐 Other
- add prometheus client by @CUHKSZzxy in #3792
- fix: add dummy_prefill guard for PD connection operations by @FirwoodLin in #3803
- minor fix about the log level and logs by @lvhan028 in #3758
- assert PytorchEngineConfig block size by @Tsundoku958 in #3826
- [ci] change restful api into openai and add more testcase by @zhulinJulia24 in #3866
- remove ppu backend by @yao-fengchen in #3904
- [ci] remove flash attn installation in ete test workflow by @zhulinJulia24 in #3908
- dlinfer backend support ray by @yao-fengchen in #3903
- style(types): fix return type annotation for get_all_requests by @xiaoajie738 in #3919
- upgrade torch to 2.8.0 and triton 3.4.0 by @lvhan028 in #3930
- Dlinfer readme by @jinminxi104 in #3938
- bump version to v0.10.0 by @lvhan028 in #3933
New Contributors
- @PeymanRM made their first contribution in #3784
- @FirwoodLin made their first contribution in #3803
- @zenosai made their first contribution in #3824
- @Tsundoku958 made their first contribution in #3826
- @guozixu2001 made their first contribution in #3807
- @zodiacg made their first contribution in #3837
- @Huarong made their first contribution in #3859
- @yuchiwang made their first contribution in #3915
- @lh9171338 made their first contribution in #3913
Full Changelog: v0.9.2...v0.10.0