All Products
Search
Document Center

Container Service for Kubernetes:Gateway with Inference Extension

Last Updated:Mar 26, 2026

Gateway with Inference Extension is an ACK component built on Envoy Gateway and the Kubernetes Gateway API Inference Extension specification. It provides Layer 4 and Layer 7 routing for Kubernetes, with intelligent load balancing for large language model (LLM) inference workloads.

Prerequisites

Before you install Gateway with Inference Extension, install the Gateway API component in your cluster. Gateway with Inference Extension requires the CustomResourceDefinitions (CRDs) that the Gateway API component provides. For installation steps, see Install components.

What's next

For usage instructions and configuration examples, see Overview of Gateway with Inference Extension.

Change log

December 2025

VersionDateChangesImpact
v1.4.0-apsara.4December 16, 2025
  • Supports the InferencePool v1 CRD.
  • Supports the vLLM v1 inference engine.
  • Improves intelligent routing scheduling under high concurrency.
Upgrading from an earlier version restarts the gateway Pod. Perform the upgrade during off-peak hours.

September 2025

VersionDateChangesImpact
v1.4.0-apsara.3September 4, 2025
  • Supports inference routes for SGLang PD-separated services.
  • Supports prefix cache-aware routing in precise mode.
  • Supports routing to external Model as a Service (MaaS) services.
  • Supports integration with Alibaba Cloud Content Moderation for AI content review.
  • Supports configuring inference routing policies using the InferenceTrafficPolicy API.
Upgrading from an earlier version restarts the gateway Pod. Perform the upgrade during off-peak hours.

May 2025

VersionDateChangesImpact
v1.4.0-aliyun.1May 27, 2025
  • Supports Gateway API 1.3.0.
  • Inference extension:
    • Supports vLLM, SGLang, and TensorRT-LLM inference frameworks.
    • Supports prefix-aware load balancing.
    • Supports routing based on model names.
    • Supports request queuing and priority scheduling.
  • Provides observability for generative AI requests.
  • Supports global rate limiting.
  • Supports global rate limiting based on tokens in generative AI requests.
  • Supports injecting Secret content into specified request headers.
Upgrading from an earlier version restarts the gateway Pod. Perform the upgrade during off-peak hours.

April 2025

VersionDateChangesImpact
v1.3.0-aliyun.2May 7, 2025
  • Supports ACS clusters.
  • Inference extension enhancements:
    • Supports referencing InferencePool resources in HTTPRoute.
    • Supports weighted routing, traffic mirroring, and circuit breaking at the InferencePool level.
  • Supports prefix-aware load balancing.
Upgrading from an earlier version restarts the gateway Pod. Perform the upgrade during off-peak hours.

March 2025

VersionDateChangesImpact
v1.3.0-aliyun.1March 12, 2025
  • Supports Gateway API v1.2.
  • Supports Inference Extension for intelligent load balancing in LLM inference scenarios.
This upgrade does not affect your services.