Gateway with Inference Extension is an ACK component built on Envoy Gateway and the Kubernetes Gateway API Inference Extension specification. It provides Layer 4 and Layer 7 routing for Kubernetes, with intelligent load balancing for large language model (LLM) inference workloads.
Prerequisites
Before you install Gateway with Inference Extension, install the Gateway API component in your cluster. Gateway with Inference Extension requires the CustomResourceDefinitions (CRDs) that the Gateway API component provides. For installation steps, see Install components.
Change log
December 2025
| Version | Date | Changes | Impact |
|---|
| v1.4.0-apsara.4 | December 16, 2025 | - Supports the InferencePool v1 CRD.
- Supports the vLLM v1 inference engine.
- Improves intelligent routing scheduling under high concurrency.
| Upgrading from an earlier version restarts the gateway Pod. Perform the upgrade during off-peak hours. |
September 2025
| Version | Date | Changes | Impact |
|---|
| v1.4.0-apsara.3 | September 4, 2025 | - Supports inference routes for SGLang PD-separated services.
- Supports prefix cache-aware routing in precise mode.
- Supports routing to external Model as a Service (MaaS) services.
- Supports integration with Alibaba Cloud Content Moderation for AI content review.
- Supports configuring inference routing policies using the InferenceTrafficPolicy API.
| Upgrading from an earlier version restarts the gateway Pod. Perform the upgrade during off-peak hours. |
May 2025
| Version | Date | Changes | Impact |
|---|
| v1.4.0-aliyun.1 | May 27, 2025 | - Supports Gateway API 1.3.0.
- Inference extension:
- Supports vLLM, SGLang, and TensorRT-LLM inference frameworks.
- Supports prefix-aware load balancing.
- Supports routing based on model names.
- Supports request queuing and priority scheduling.
- Provides observability for generative AI requests.
- Supports global rate limiting.
- Supports global rate limiting based on tokens in generative AI requests.
- Supports injecting Secret content into specified request headers.
| Upgrading from an earlier version restarts the gateway Pod. Perform the upgrade during off-peak hours. |
April 2025
| Version | Date | Changes | Impact |
|---|
| v1.3.0-aliyun.2 | May 7, 2025 | - Supports ACS clusters.
- Inference extension enhancements:
- Supports referencing InferencePool resources in HTTPRoute.
- Supports weighted routing, traffic mirroring, and circuit breaking at the InferencePool level.
- Supports prefix-aware load balancing.
| Upgrading from an earlier version restarts the gateway Pod. Perform the upgrade during off-peak hours. |
March 2025
| Version | Date | Changes | Impact |
|---|
| v1.3.0-aliyun.1 | March 12, 2025 | - Supports Gateway API v1.2.
- Supports Inference Extension for intelligent load balancing in LLM inference scenarios.
| This upgrade does not affect your services. |