Knative Pod Autoscaler (KPA) is a user-friendly auto scaling feature provided by Knative. It allows you to set scaling conditions based on pod concurrency and requests per second (RPS). By default, Knative reduces the number of pods to zero when there are no active service requests, optimizing resource usage. Through the KPA, you can configure scale-down parameters, such as the wait time before scaling down to zero, and specify whether scaling down to zero should be enabled.
Prerequisites
Knative is deployed in your cluster. For more information, see Deploy Knative.
How it works
Knative Serving injects a Queue Proxy container named queue-proxy
into each pod. The container automatically reports the request concurrency metrics of application pods to KPA. After KPA receives the metrics, KPA automatically adjusts the number of pods provisioned for a Deployment based on the number of concurrent requests and the related algorithm.
Expand to view the KPA algorithm
KPA configurations
Some configurations can be enabled at the revision level using annotations or applied globally through ConfigMaps. When both methods are used at the same time, the revision-level configurations take precedence.
config-autoscaler
To configure KPA, you must configure config-autoscaler. By default, config-autoscaler is configured. The following content describes the key parameters.
Run the following command to query config-autoscaler:
kubectl -n knative-serving describe cm config-autoscaler
Expected output (the default config-autoscaler ConfigMap):
apiVersion: v1
kind: ConfigMap
metadata:
name: config-autoscaler
namespace: knative-serving
data:
# The default maximum concurrency of pods. The default value is 100.
container-concurrency-target-default: "100"
# The target utilization for concurrency. The default value is 70, which represents 0.7.
container-concurrency-target-percentage: "70"
# The default requests per second (RPS). The default value is 200.
requests-per-second-target-default: "200"
# The target burst capacity parameter is used to handle traffic bursts and prevent pod overloading. The current default value is 211, which means that if the target threshold set for the service multiplied by the number of Ready pods is less than 211, routing will be handled through the Activator service.
# The Activator service is used to receive and buffer requests when the target burst capacity is exceeded.
# If the target burst capacity parameter is set to 0, the Activator service is placed in the request path only when the number of pods is scaled to zero.
# If the target burst capacity parameter is set to a value greater than 0 and container-concurrency-target-percentage is set to 100, the Activator service is always used to receive and buffer requests.
# If the target burst capacity parameter is set to -1, the burst capacity is unlimited. All requests are buffered by the Activator service. If you set the target burst capacity parameter to other values, the parameter does not take effect.
# If the value of current number of ready pods × maximum concurrency - target burst capacity - concurrency calculated in panic mode is smaller than 0, the traffic burst exceeds the target burst capacity. In this scenario, the Activator service is placed to buffer requests.
target-burst-capacity: "211"
# The stable window. The default is 60 seconds.
stable-window: "60s"
# The panic window percentage. The default is 10, which indicates that the default panic window is 6 seconds (60 × 0.1).
panic-window-percentage: "10.0"
# The panic threshold percentage. The default is 200.
panic-threshold-percentage: "200.0"
# The maximum scale up rate, which indicates the maximum ratio of desired pods per scale-out activity. The value is calculated based on the following formula: math.Ceil(MaxScaleUpRate*readyPodsCount).
max-scale-up-rate: "1000.0"
# The maximum scale down rate. The default is 2, which indicates that pods are scaled to half of the current number during each scale-in activity.
max-scale-down-rate: "2.0"
# Specifies whether to scale the number of pods to zero. By default, this feature is enabled.
enable-scale-to-zero: "true"
# The graceful period before the number of pods is scaled to zero. The default is 30 seconds.
scale-to-zero-grace-period: "30s"
# The retention period of the last pod before the number of pods is scaled to zero. Specify this parameter if the cost for launching pods is high.
scale-to-zero-pod-retention-period: "0s"
# The type of the autoscaler. The following autoscalers are supported: KPA, Horizontal Pod Autoscaler (HPA), and Advanced Horizontal Pod Autoscaler (AHPA).
pod-autoscaler-class: "kpa.autoscaling.knative.dev"
# The request capacity of the Activator service.
activator-capacity: "100.0"
# The number of pods to be initialized when a revision is created. The default is 1.
initial-scale: "1"
# Specifies whether no pod is initialized when a revision is created. The default is false, which indicates that pods are initialized when a revision is created.
allow-zero-initial-scale: "false"
# The minimum number of pods kept for a revision. The default is 0, which means that no pod is kept.
min-scale: "0"
# The maximum number of pods to which a revision can be scaled. The default is 0, which means that the number of pods to which a revision can be scaled is unlimited.
max-scale: "0"
# The scale down delay. The default is 0, which indicates a scale-in activity is performed immediately.
scale-down-delay: "0s"
Metrics
You can use the autoscaling.knative.dev/metric
annotation to configure metrics for a revision. Different autoscalers support different metrics.
Supported metrics:
"concurrency"
,"rps"
,"cpu"
,"memory"
, and custom metrics.Default metric:
"concurrency"
.
Configure the concurrency metric
Configure the RPS metric
Configure the CPU metric
Configure the memory metric
Configure the target threshold
You can use the autoscaling.knative.dev/target
annotation to configure a target for a revision. You can also use the container-concurrency-target-default
annotation to configure a global target in a ConfigMap.
Configure for a revision
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: helloworld-go
namespace: default
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/target: "50"
Configure globally
apiVersion: v1
kind: ConfigMap
metadata:
name: config-autoscaler
namespace: knative-serving
data:
container-concurrency-target-default: "200"
Configure scale-to-zero
Configure global scale-to-zero
The enable-scale-to-zero
parameter specifies whether to scale the number of pods to zero when the specified Knative Service is idle. Valid values are "false"
and "true"
.
apiVersion: v1
kind: ConfigMap
metadata:
name: config-autoscaler
namespace: knative-serving
data:
enable-scale-to-zero: "false" # If the parameter is set to "false", the scale-to-zero feature is disabled. In this case, when the specified Knative Service is idle, pods are not scaled to zero.
Configure the graceful period for scale-to-zero
The scale-to-zero-grace-period
parameter specifies the graceful period before the pods of the specified Knative Service are scaled to zero.
apiVersion: v1
kind: ConfigMap
metadata:
name: config-autoscaler
namespace: knative-serving
data:
scale-to-zero-grace-period: "40s"
Configure the retention period for scale-to-zero
Configure for a revision
The autoscaling.knative.dev/scale-to-zero-pod-retention-period
annotation specifies the retention period of the last pod before the pods of a Knative Service are scaled to zero.
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: helloworld-go
namespace: default
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/scale-to-zero-pod-retention-period: "1m5s"
spec:
containers:
- image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56
Configure globally
The scale-to-zero-pod-retention-period
annotation specifies the global retention period of the last pod before the pods of a Knative Service are scaled to zero.
apiVersion: v1
kind: ConfigMap
metadata:
name: config-autoscaler
namespace: knative-serving
data:
scale-to-zero-pod-retention-period: "42s"
Configure the concurrency
The concurrency indicates the maximum number of requests that a pod can process concurrently. You can configure the concurrency by setting the soft concurrency limit, hard concurrency limit, target utilization, and RPS.
Configure the soft concurrency limit
The soft concurrency limit is a targeted limit rather than a strictly enforced bound. In some scenarios, particularly if a burst of requests occurs, the value may be exceeded.
Configure for a revision
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: helloworld-go
namespace: default
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/target: "200"
Configure globally
apiVersion: v1
kind: ConfigMap
metadata:
name: config-autoscaler
namespace: knative-serving
data:
container-concurrency-target-default: "200" # Specify a concurrency target for a Knative Service.
Configure the hard concurrency limit for a revision
We recommend that you specify the hard concurrency limit only if your application has a specific concurrency upper limit. Setting a low hard concurrency limit adversely affects the throughput and response latency of your application.
The hard concurrency limit is a strictly enforced limit. When the hard concurrency limit is reached, excess requests are buffered by queue-proxy or the Activator service until sufficient resources can be used to handle the requests.
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: helloworld-go
namespace: default
spec:
template:
spec:
containerConcurrency: 50
Target utilization
The target utilization specifies the actual percentage of the target of the autoscaler. You can also use the target utilization to adjust the concurrency value. The target utilization is also known as the hotness at which a pod runs. This causes the autoscaler to scale out before the specified hard concurrency limit is reached.
For example, if containerConcurrency
is set to 10 and the target utilization is set to 70 (percentage), the autoscaler creates a pod when the average number of concurrent requests across all existing pods reaches 7. It takes a period of time for a pod to enter the Ready state after the pod is created. You can decrease the target utilization value to create new pods before the hard concurrency limit is reached. This helps reduce the response latency caused by cold starts.
Configure for a revision
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: helloworld-go
namespace: default
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/target-utilization-percentage: "70" # Configure the target utilization percentage.
spec:
containers:
- image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56
Configure globally
apiVersion: v1
kind: ConfigMap
metadata:
name: config-autoscaler
namespace: knative-serving
data:
container-concurrency-target-percentage: "70" # KPA attempts to ensure that the concurrency of each pod does not exceed 70% of the current resources.
Configure the RPS
The RPS specifies the number of requests that can be processed by a pod per second.
Configure for a revision
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: helloworld-go
namespace: default
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/target: "150"
autoscaling.knative.dev/metric: "rps" # The number of pods is adjusted based on the RPS value.
spec:
containers:
- image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56
Configure globally
apiVersion: v1
kind: ConfigMap
metadata:
name: config-autoscaler
namespace: knative-serving
data:
requests-per-second-target-default: "150"
Scenario 1: Configure auto scaling by setting a concurrency target
This example shows how to enable KPA to perform auto scaling by setting a concurrency target.
Use the the following YAML template to create a file named autoscale-go.yaml and deploy the file in the cluster:
apiVersion: serving.knative.dev/v1 kind: Service metadata: name: autoscale-go namespace: default spec: template: metadata: labels: app: autoscale-go annotations: autoscaling.knative.dev/target: "10" # Set the concurrency target to 10. spec: containers: - image: registry.cn-hangzhou.aliyuncs.com/knative-sample/autoscale-go:0.1
kubectl apply -f autoscale-go.yaml
Obtain the Ingress gateway.
ALB
Run the following command to obtain the Ingress gateway:
kubectl get albconfig knative-internet
Expected output:
NAME ALBID DNSNAME PORT&PROTOCOL CERTID AGE knative-internet alb-hvd8nngl0lsdra15g0 alb-hvd8nng******.cn-beijing.alb.aliyuncs.com 2
MSE
Run the following command to obtain the Ingress gateway:
kubectl -n knative-serving get ing stats-ingress
Expected output:
NAME CLASS HOSTS ADDRESS PORTS AGE stats-ingress knative-ingressclass * 101.201.XX.XX,192.168.XX.XX 80 15d
ASM
Run the following command to obtain the Ingress gateway:
kubectl get svc istio-ingressgateway --namespace istio-system --output jsonpath="{.status.loadBalancer.ingress[*]['ip']}"
Expected output:
121.XX.XX.XX
Kourier
Run the following command to obtain the Ingress gateway:
kubectl -n knative-serving get svc kourier
Expected output:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kourier LoadBalancer 10.0.XX.XX 39.104.XX.XX 80:31133/TCP,443:32515/TCP 49m
Use the load testing tool hey to send 50 concurrent requests to the application within 30 seconds.
NoteFor more information about hey, see hey.
hey -z 30s -c 50 -host "autoscale-go.default.example.com" "http://121.199.XXX.XXX" # 121.199.XXX.XXX is the IP address of the Ingress gateway.
Expected output:
The output indicates that five pods are added as expected.
Scenario 2: Enable auto scaling by setting scale bounds
Scale bounds control the minimum and maximum numbers of pods that can be provisioned for an application. This example shows how to enable auto scaling by setting scale bounds.
Use the following YAML template to create a file named autoscale-go.yaml and deploy the file to the cluster:
This YAML template sets the concurrency target to 10,
min-scale
to 1, andmax-scale
to 3.apiVersion: serving.knative.dev/v1 kind: Service metadata: name: autoscale-go namespace: default spec: template: metadata: labels: app: autoscale-go annotations: autoscaling.knative.dev/target: "10" autoscaling.knative.dev/min-scale: "1" autoscaling.knative.dev/max-scale: "3" spec: containers: - image: registry.cn-hangzhou.aliyuncs.com/knative-sample/autoscale-go:0.1
kubectl apply -f autoscale-go.yaml
Obtain the Ingress gateway.
ALB
Run the following command to obtain the Ingress gateway:
kubectl get albconfig knative-internet
Expected output:
NAME ALBID DNSNAME PORT&PROTOCOL CERTID AGE knative-internet alb-hvd8nngl0lsdra15g0 alb-hvd8nng******.cn-beijing.alb.aliyuncs.com 2
MSE
Run the following command to obtain the Ingress gateway:
kubectl -n knative-serving get ing stats-ingress
Expected output:
NAME CLASS HOSTS ADDRESS PORTS AGE stats-ingress knative-ingressclass * 101.201.XX.XX,192.168.XX.XX 80 15d
ASM
Run the following command to obtain the Ingress gateway:
kubectl get svc istio-ingressgateway --namespace istio-system --output jsonpath="{.status.loadBalancer.ingress[*]['ip']}"
Expected output:
121.XX.XX.XX
Kourier
Run the following command to obtain the Ingress gateway:
kubectl -n knative-serving get svc kourier
Expected output:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kourier LoadBalancer 10.0.XX.XX 39.104.XX.XX 80:31133/TCP,443:32515/TCP 49m
Use the load testing tool hey to send 50 concurrent requests to the application within 30 seconds.
NoteFor more information about hey, see hey.
hey -z 30s -c 50 -host "autoscale-go.default.example.com" "http://121.199.XXX.XXX" # 121.199.XXX.XXX is the IP address of the Ingress gateway.
Expected output:
At most three pods are added. One pod is reserved when no traffic flows to the application.
References
You can use AHPA in Knative to implement scheduled scaling by predicting resource demand based on historical metrics. For more information, see Use AHPA to implement scheduled auto scaling.