Releases: xorbitsai/inference
v1.10.0
What's new in 1.10.0 (2025-09-13)
These are the changes in inference v1.10.0.
New features
- FEAT: [model] Support Kokoro-82M-v1.1-zh by @JavisPeng in #4042
- FEAT: IP restriction by env: XINFERENCE_ALLOWED_IPS by @qxo in #4047
- FEAT: add support for the Anthropic API format by @OliverBryant in #4037
- FEAT: Openai API support vLLM json schema output by @OliverBryant in #4061
Enhancements
- ENH: Update the environment dependencies for GOT-OCR2 by @Gmgge in #4031
- ENH: Clean memory during running MLX version's LLM models by @OliverBryant in #4026
- BLD: bump funasr to 1.2.7 by @leslie2046 in #4039
- BLD: cu128 version Dockerfile fix by @zwt-1234 in #4056
- BLD: Update Dockerfile.cu128 by @amumu96 in #4059
- REF: refactor tool calls functionality by @amumu96 in #4025
Bug fixes
- BUG: Fix Kokoro-82M can't run on GPU by @OliverBryant in #4034
- BUG: [embeddings] fix parsing str type hf_overrides for vllm engine by @llyycchhee in #4052
- BUG: missing usage info in jina-embedding-v4 model response by @amumu96 in #4054
- BUG: distributed registration bug by @llyycchhee in #4046
New Contributors
- @JavisPeng made their first contribution in #4042
- @qxo made their first contribution in #4047
Full Changelog: v1.9.1...v1.10.0
v1.9.1
What's new in 1.9.1 (2025-08-30)
These are the changes in inference v1.9.1.
New features
- FEAT: Qwen-Image-Edit by @qinxuye in #3989
- FEAT: Wan 2.2 by @qinxuye in #3996
- FEAT: Update CosyVoice2 to support both streaming and non-streaming speech generation by @Gmgge in #3994
- FEAT: support qwen-image-lightning by @qinxuye in #3995
- FEAT: [UI] support gpu_count configuration in image model. by @yiboyasss in #4016
- FEAT: image2image and inpainting for qwen-image by @qinxuye in #4014
- FEAT: Support Custom vllm embedding dim by @zhcn000000 in #4000
- FEAT: [embedding] support
dimensions
for embedding by @llyycchhee in #3965 - FEAT: [Model] Support DeepSeek-V3.1 Quantization and tool by @Jun-Howie in #4022
- FEAT: Seed-OSS-36B by @Jun-Howie in #4020
Enhancements
- ENH: added zero shot and voice cloning ability for audio models by @qianduoduo0904 in #3968
- ENH: Add Template for Qwen3 Reranker when model_engine = vllm by @zhcn000000 in #3983
- ENH: Update the environment dependencies for cosyvoice2 by @Gmgge in #4015
- ENH: Compat with xllamacpp 0.2.0 by @codingl2k1 in #4004
- ENH: support chat_template_kwargs for llama.cpp by @qinxuye in #3988
- BLD: Clean up Docker's last legacy cache and images before executing each step by @zwt-1234 in #3963
- BLD: fix CI failures by @qinxuye in #4002
Bug fixes
- BUG: disable flash_attention when GPU compute capability < 8.0 by @amumu96 in #3973
- BUG: fix rerank model creation by @qinxuye in #3977
Documentation
- DOC: update models by @qinxuye in #3958
- DOC: add setting limitation of images for multi modal doc by @amumu96 in #4003
- DOC: Update docs about custom models by @OliverBryant in #4019
- DOC: update models & README by @qinxuye in #4023
Others
- FEAT:KAT-V1 by @Jun-Howie in #3998
New Contributors
- @qianduoduo0904 made their first contribution in #3968
- @OliverBryant made their first contribution in #4019
Full Changelog: v1.9.0...v1.9.1
v1.9.0
What's new in 1.9.0 (2025-08-16)
These are the changes in inference v1.9.0.
New features
- FEAT: [UI] running models data display replica. by @yiboyasss in #3897
- FEAT: [model] Qwen-Image by @qinxuye in #3916
- FEAT: [model] gpt-oss by @qinxuye in #3924
- FEAT: function calling support for deepseek-r1-0528 by @qinxuye in #3931
- FEAT: Support for GLM 4.5 quantized models by @Jun-Howie in #3945
- FEAT: sglang support streaming function call by @aniya105 in #3939
- FEAT: parsing harmony format for gpt-oss by @qinxuye in #3948
- FEAT: Add support for switching rerank model engines and support for rerank of vllm engine by @zhcn000000 in #3881
- FEAT: Support GLM-4.5v by @Jun-Howie in #3957
Enhancements
- ENH: Add qwen3 new model to tool call list by @zhcn000000 in #3900
- ENH: Update chat_template for Qwen3-Coder by @Jun-Howie in #3944
- ENH: add flash_attention control params attn_implementation by @amumu96 in #3951
- ENH: support qwen-image gguf by @qinxuye in #3954
- ENH: clean embedding model cache when using vllm engine by @amumu96 in #3956
- BLD: Downgrade flash-attn to version 2.7.4 by @zwt-1234 in #3953
- BLD: Add Openfst source by @zwt-1234 in #3959
Bug fixes
Documentation
- DOC: add doc about cu128 docker by @qinxuye in #3899
- DOC: Update xllamacpp doc by @codingl2k1 in #3862
Others
Full Changelog: v1.8.1...v1.9.0
v1.8.1
What's new in 1.8.1 (2025-08-03)
These are the changes in inference v1.8.1.
New features
- FEAT: kokoro mlx support by @qinxuye in #3823
- FEAT: Qwen3-Instruct by @Jun-Howie in #3840
- FEAT: [UI] integrate user favorites into feature model output. by @yiboyasss in #3859
- FEAT: support enable virtualenv and specify packages when lauching model by @qinxuye in #3854
- FEAT: [UI] support enable virtualenv and specify packages when lauching model. by @yiboyasss in #3867
- FEAT: setting max_tokens to maximum if not specified by @qinxuye in #3872
- FEAT: [model] support GLM-4.5 series by @qinxuye in #3882
- FEAT: Qwen3-30B-A3B-it by @Jun-Howie in #3886
- FEAT: Support Qwen3-Thinking by @Jun-Howie in #3888
- FEAT: Support Qwen3-Coder by @Jun-Howie in #3889
Enhancements
- ENH: add mlu device check by @nan9126 in #3844
- ENH: Support for the bge-m3 llama.cpp backend by @codingl2k1 in #3861
- ENH: Added mlx support for deepseek-v3-0324 by @uebber in #3864
- ENH: Add context length limits and automatic truncation features to vLLM embedding models. by @amumu96 in #3887
- BLD: remove sglang from pip install xinference[all] due to depedency conflicts with vllm by @qinxuye in #3865
- BLD: upgrade base image for dockerfile by @zwt-1234 in #3318
- BLD: change dokcer build time to 240 minutes to pass 12.8 build by @qinxuye in #3892
- REF: add ui module that includes web and gradio UIs. by @qinxuye in #3819
- REF: move continuous batching scheduler into model by @qinxuye in #3824
Bug fixes
- BUG: Fixed an error when using structured output in sglang #3825 by @aniya105 in #3826
- BUG: fix compatibility for old vllm by @qinxuye in #3838
- BUG: Fix abnormal GPU memory usage in Qwen3 Reranker by @JDanielWu in #3846
- BUG: fix compatibility with vllm 0.10.0 by @qinxuye in #3875
- BUG: fix version checks for vllm by @qinxuye in #3891
Documentation
- DOC: add experimental feature for virtualenv by @qinxuye in #3818
- DOC: add doc about model virtual env settings when lauching model by @qinxuye in #3885
Others
- FIX: GLM4.1V Repository URL by @Jun-Howie in #3839
- BLD:fix docker build for cu128 by @zwt-1234 in #3893
- BLD:fix cu128 build by @zwt-1234 in #3895
- CHORE: THUDM has been renamed to zai-org by @Jun-Howie in #3870
New Contributors
- @JDanielWu made their first contribution in #3846
- @uebber made their first contribution in #3864
- @zwt-1234 made their first contribution in #3318
Full Changelog: v1.8.0...v1.8.1
v1.8.1.rc1
What's new in 1.8.1.rc1 (2025-08-03)
These are the changes in inference v1.8.1.rc1.
New features
- FEAT: kokoro mlx support by @qinxuye in #3823
- FEAT: Qwen3-Instruct by @Jun-Howie in #3840
- FEAT: [UI] integrate user favorites into feature model output. by @yiboyasss in #3859
- FEAT: support enable virtualenv and specify packages when lauching model by @qinxuye in #3854
- FEAT: [UI] support enable virtualenv and specify packages when lauching model. by @yiboyasss in #3867
- FEAT: setting max_tokens to maximum if not specified by @qinxuye in #3872
- FEAT: [model] support GLM-4.5 series by @qinxuye in #3882
- FEAT: Qwen3-30B-A3B-it by @Jun-Howie in #3886
- FEAT: Support Qwen3-Thinking by @Jun-Howie in #3888
- FEAT: Support Qwen3-Coder by @Jun-Howie in #3889
Enhancements
- ENH: add mlu device check by @nan9126 in #3844
- ENH: Support for the bge-m3 llama.cpp backend by @codingl2k1 in #3861
- ENH: Added mlx support for deepseek-v3-0324 by @uebber in #3864
- ENH: Add context length limits and automatic truncation features to vLLM embedding models. by @amumu96 in #3887
- BLD: remove sglang from pip install xinference[all] due to depedency conflicts with vllm by @qinxuye in #3865
- BLD: upgrade base image for dockerfile by @zwt-1234 in #3318
- REF: add ui module that includes web and gradio UIs. by @qinxuye in #3819
- REF: move continuous batching scheduler into model by @qinxuye in #3824
Bug fixes
- BUG: Fixed an error when using structured output in sglang #3825 by @aniya105 in #3826
- BUG: fix compatibility for old vllm by @qinxuye in #3838
- BUG: Fix abnormal GPU memory usage in Qwen3 Reranker by @JDanielWu in #3846
- BUG: fix compatibility with vllm 0.10.0 by @qinxuye in #3875
- BUG: fix version checks for vllm by @qinxuye in #3891
Documentation
- DOC: add experimental feature for virtualenv by @qinxuye in #3818
- DOC: add doc about model virtual env settings when lauching model by @qinxuye in #3885
Others
- FIX: GLM4.1V Repository URL by @Jun-Howie in #3839
- CHORE: THUDM has been renamed to zai-org by @Jun-Howie in #3870
New Contributors
- @JDanielWu made their first contribution in #3846
- @uebber made their first contribution in #3864
- @zwt-1234 made their first contribution in #3318
Full Changelog: v1.8.0...v1.8.1.rc1
v1.8.0
What's new in 1.8.0 (2025-07-20)
These are the changes in inference v1.8.0.
New features
- FEAT: Embedding support llama.cpp backend by @codingl2k1 in #3730
- FEAT: non-stream tool calling for sglang by @aniya105 in #3760
- FEAT: support migrate from v1 to v2 for custom models by @qinxuye in #3810
- FEAT: FLUX.1-Kontext-dev by @qinxuye in #3728
- FEAT: support ERNIE 4.5 by @qinxuye in #3812
- FEAT: [embedding] add support for jina-embeddings-v4 model by @Minamiyama in #3814
- FEAT: [model] support glm-4.1v-thinking by @llyycchhee in #3756
Enhancements
- ENH: Pin xllamacpp>=0.1.23 by @codingl2k1 in #3780
- ENH: add modelscope for fish speech 1.5 by @qinxuye in #3750
- REF: [V2 BREAK] Merge multiple JSON files into one for difference model download sources by @ChengjieLi28 in #3765
Bug fixes
- BUG: disable flash_attn for qwen3 embedding & rerank when no gpu available by @qinxuye in #3739
- BUG: Fix bugs in del async_client by @zhcn000000 in #3753
- BUG: add message preprocessing to ensure that content is not null by @amumu96 in #3791
- BUG: pre check to prevent from list index out of range for FunASR family models by @leslie2046 in #3809
- BUG: resolve issue where AI output was lost when no tool was selected for function call #3767 by @aniya105 in #3768
- BUG: fix error in
content
output atreasoning_content
, when usingenable_thinking
inchat_template_kwargs
by @amumu96 in #3794
Documentation
- DOC: fix links by @qinxuye in #3774
- DOC: update info in docs by @qinxuye in #3779
- DOC: update models by @qinxuye in #3815
Full Changelog: v1.7.1...v1.8.0
v1.7.1.post1
What's new in 1.7.1.post1 (2025-06-30)
These are the changes in inference v1.7.1.post1.
Enhancements
- BLD: pin transformers version at 4.52.4 to fix "Failed to import module 'SentenceTransformer'" error by @amumu96 in #3743
Full Changelog: v1.7.1...v1.7.1.post1
v1.7.1
What's new in 1.7.1 (2025-06-27)
These are the changes in inference v1.7.1.
New features
- FEAT: [UI] enhance audio & rerank model registration params. by @yiboyasss in #3656
- FEAT: support async client by @zhcn000000 in #3645
- FEAT: [UI] add max_tokens display in rerank model. by @yiboyasss in #3671
- FEAT: [UI] add model_ability options for LLM registration. by @yiboyasss in #3663
- FEAT: support qwenLong-l1 by @Jun-Howie in #3691
- FEAT: [UI] model registration supports packages. by @yiboyasss in #3702
- FEAT: support MLU device by @nan9126 in #3693
- FEAT: vllm v1 auto enabling by @qinxuye in #3637
- FEAT: distributed inference for MLX by @qinxuye in #3700
Enhancements
- ENH: add
enable_flash_attn
param for loading qwen3 embedding & rerank by @qinxuye in #3640 - ENH: add more abilities for builtin model families API by @qinxuye in #3658
- ENH: improve local cluster startup reliability via child-process readiness signaling by @Checkmate544 in #3642
- ENH: FishSpeech support pcm by @codingl2k1 in #3680
- ENH: Add 4-sample micro-batching to Qwen-3 reranker to reduce GPU memory by @yasu-oh in #3666
- ENH: Limit default n_parallel for llama.cpp backend by @codingl2k1 in #3712
- BLD: pin flash-attn & flashinfer-python version and limit sgl-kernel version by @amumu96 in #3669
- BLD: Update Dockerfile by @XiaoXiaoJiangYun in #3695
- REF: remove unused code by @qinxuye in #3664
Bug fixes
- BUG: fix TTS error bug :No such file or directory by @robin12jbj in #3625
- BUG: Fix max_tokens value in Qwen3 Reranker by @yasu-oh in #3665
- BUG: fix custom embedding by @qinxuye in #3677
- BUG: [UI] rename the command-line argument from download-hub to download_hub. by @yiboyasss in #3685
- BUG: fix jina-clip-v2 for text only or image only by @qinxuye in #3690
- BUG: internvl chat error using vllm engine by @amumu96 in #3722
- BUG: fix the parsing logic of streaming tool calls by @amumu96 in #3721
- BUG: fix
<think>
wrongly added when setchat_template_kwargs {"enable_thinking": False}
by @qinxuye in #3718
Documentation
- DOC: add doc for paraformer by @leslie2046 in #3631
- DOC: Flexible model (traditional ML models) by @qinxuye in #3714
New Contributors
- @robin12jbj made their first contribution in #3625
- @zhcn000000 made their first contribution in #3645
- @yasu-oh made their first contribution in #3665
- @Checkmate544 made their first contribution in #3642
- @nan9126 made their first contribution in #3693
- @XiaoXiaoJiangYun made their first contribution in #3695
Full Changelog: v1.7.0...v1.7.1
v1.7.0.post1
What's new in 1.7.0.post1 (2025-06-13)
These are the changes in inference v1.7.0.post1.
Bug fixes
- BUG: fix qwen3-rerank to create model on GPU by @qinxuye in #3630
- BUG: fix mincpm4 modeling by @Jun-Howie in #3632
Full Changelog: v1.7.0...v1.7.0.post1
v1.7.0
What's new in 1.7.0 (2025-06-13)
These are the changes in inference v1.7.0.
New features
- FEAT: support CogView4 image model by @qinxuye in #3557
- FEAT: [UI] support model_ability filter for image and video models. by @yiboyasss in #3563
- FEAT: [UI] auto-switch to active tab when Running Models page loads. by @yiboyasss in #3568
- FEAT: support first-last-frame to video by @qinxuye in #3555
- FEAT: [UI] add Japanese and Korean language support. by @yiboyasss in #3574
- FEAT: SeACoParaformer model by @leslie2046 in #3587
- FEAT: support verbose_json for funasr family audio2text models by @leslie2046 in #3591
- FEAT: support deepseek-r1-0528 Mixed quantization by @Jun-Howie in #3601
- FEAT: support engines for embedding models by @pengjunfeng11 in #2791
- FEAT:support MiniCPM4 Series by @Jun-Howie in #3609
- FEAT: [UI] add model_engine parameter to embedding model. by @yiboyasss in #3617
- FEAT: add kwargs for transripts client API by @leslie2046 in #3622
- FEAT: support qwen3 embedding by @qinxuye in #3615
- FEAT: support qwen3-reranker by @qinxuye in #3627
Enhancements
- ENH: Support pcm response_format by @codingl2k1 in #3606
Bug fixes
- BUG: Fix dependency by @codingl2k1 in #3566
- BUG: Fix cmdline by @codingl2k1 in #3589
- BUG: fix potential hang for sglang by @qinxuye in #3597
- BUG: [UI] fixed the mobile language switching bug. by @yiboyasss in #3608
- BUG: Fix the error when using Qwen function call with Spring AI. by @aniya105 in #3614
Documentation
- DOC: update links by @qinxuye in #3565
- DOC: Update CosyVoice doc by @codingl2k1 in #3605
- DOC: update models by @qinxuye in #3628
Others
- FIX: [UI] fix model_engine parameter bug. by @yiboyasss in #3620
New Contributors
Full Changelog: v1.6.1...v1.7.0