Release OpenVINO Model Server 2026.0 · openvinotoolkit/model_server

Performance improvements

Improvements in performance and accuracy for GPT-OSS and Qwen3-MOE models.
Improvements in execution performance especially on Intel® Core™ Ultra Series 3 built-in GPUs.
Better accuracy with INT4 precisions especially with long prompts.
Corrected handling of compilation cache to speed up model loading.

Improved chat template examples to fix handling agentic use cases.
Improvements in tool parsers to be less restrictive for the generated content and improve response reliability.
Added support for tool parser compatible with devstral model – take advantage of unsloth/Devstral-Small-2507 model or similar for coding tasks. See the code local assistant demo and LLM reference for details.

Improvements in text2speech endpoint:
- Added voice parameter to choose speaker based on provided embeddings vector.
Improvements in speech2text endpoint:
- Added handling for temperature sampling parameter.
- Support for timestamps in the output.

New parameters have been added to VLM pipelines to control domain name restrictions for image URLs in requests, with optional URL redirection support. By default, all URLs are blocked. Use --allowed_media_domains and --allowed_local_media_path to configure the allowed sources. See server parameters for details.

NPU execution for text embeddings endpoint (preview). Check the embeddings demo for details.
Exposed tokenizer endpoint for reranker and LLM pipelines.

Added configurable preprocessing for classic models. Deployed models can include extra preprocessing layers added in at runtime. This can simplify client implementations and enable sending encoded images to models, which are accepted as an array of input. Possible options include:
- Color format change
- Layout change
- Scale changes
- Mean changes
- Precision change
See server parameters and the ONNX model preprocessing demo for details.
Optimized file handle usage to reduce the number of open files during high-load operations on Linux deployments.

Optimized file handle usage to reduce the number of open files during high-load operations on Linux deployments.
Security improvements.

Qwen3-MOE models like Qwen3-Coder-30B-Instruct or Qwen3-30B-A3B in int4 quantization, when deployed on GPU, might have reduced accuracy with long prompts. Temporary workaround it to set environment variable MOE_USE_MICRO_GEMM_PREFILL=0 before starting ovms process. It will deactivate problematic transformation. It will increase slightly TTFT metric. CPU target device or precisions other than int4 on GPU are not impacted.
gpt-oss model, when deployed on GPU, should be used only with single concurrency. With high concurrency, there is a risk of impacted accuracy results. CPU target device is not impacted by this issue.

Both issues are expected to be fixed soon.

You can use an OpenVINO Model Server public Docker images based on Ubuntu via the following command:

docker pull openvino/model_server:2026.0 - CPU device support with image based on Ubuntu 24.04
docker pull openvino/model_server:2026.0-gpu - GPU, NPU and CPU device support with image based on Ubuntu 24.04

or use provided binary packages. Only packages with suffix _python_on have support for python.

Check the instructions how to install the binary package. The prebuilt image is available also on RedHat Ecosystem Catalog