Realtime Compute for Apache Flink deployments support two automatic tuning modes: Autopilot and scheduled tuning. This topic describes how to configure automatic tuning. This topic also describes the precautions that you must take note of when you configure automatic tuning.
Background information
In most cases, a large amount of time is required for deployment tuning. For example, when you publish a draft, you must configure resources, parallelism, and the number and size of TaskManagers for the draft. When a deployment runs, you must adjust the resources of the deployment to maximize resource utilization. If backpressure occurs on the deployment or the latency increases, you must adjust the configurations of the deployment. Realtime Compute for Apache Flink supports automatic tuning. You can select an appropriate tuning mode based on the information that is described in the following table.
Tuning mode | Scenario | Benefit | References |
Tuning mode | Scenario | Benefit | References |
Autopilot | A deployment uses 30 compute units (CUs). After the deployment runs for a period of time, the CPU utilization and memory usage of the deployment may be excessively low when no latency or backpressure occurs in the source. If you do not want to manually adjust resources, you can use the Autopilot mode to allow the system to automatically adjust resources. If the resource usage is low, the system automatically downgrades the resource configuration. If the resource usage reaches the specified threshold, the system automatically upgrades the resource configuration. | Helps you adjust the deployment parallelism and resource configuration based on your business requirements. Globally optimizes your deployment. This helps handle performance issues, such as low deployment throughput, upstream and downstream backpressure, and a waste of resources.
| |
Scheduled tuning | A scheduled tuning plan describes the relationships between resources and time points. A scheduled tuning plan can contain multiple groups of relationships between resources and time points. When you use a scheduled tuning plan, you must know the resource usage during each period of time and configure resources based on the characteristics of business during the related period of time. For example, the business peak hours in a day are from 09:00:00 to 19:00:00, and the off-peak hours are from 19:00:00 to 09:00:00 of the next day. In this case, you can enable scheduled tuning to use 30 CUs for your deployment during the peak hours and 10 CUs during the off-peak hours. | For more information about how to configure a scheduled tuning plan, see Configure and apply scheduled tuning. |
Limits
A maximum of 20 resource plans can be created.
You cannot modify the deployment parallelism if you enable the Unaligned Checkpoints feature.
Autopilot is not supported for deployments that are deployed in session clusters.
Automatic tuning is not supported for YAML deployments.
The tuning modes are mutually exclusive. You must disable the tuning mode in use before you can use the other one.
You cannot use Autopilot or scheduled tuning at the same time. If you want to change the tuning mode, you must first disable the tuning mode in use.
Scheduled tuning plans are mutually exclusive. You can apply only one scheduled tuning plan at the same time. If you want to change the scheduled tuning plan, you must first stop the scheduled tuning plan that is in use.
Precautions
After automatic tuning is triggered for a deployment, the deployment is restarted. During the restart process, the deployment stops data consumption for a short period of time.
Note
For Realtime Compute for Apache Flink that uses Ververica Runtime (VVR) 8.0.1 or later, after automatic tuning is triggered for a deployment, Realtime Compute for Apache Flink attempts to dynamically update parameter configurations of the deployment. If the dynamic update fails, Realtime Compute for Apache Flink restarts the entire deployment. The service interruption time during dynamic parameter updates is 30% to 98% shorter than the service interruption time during the restart of the entire deployment. The service interruption time depends on the deployment status and logic. Only the configuration of the Parallelism parameter can be dynamically updated. For more information, see Dynamically update the parameter configuration for dynamic scaling.
If you use a DataStream deployment or a custom SQL connector, make sure that the Parallelism parameter is not configured in the code of the deployment. Otherwise, Autopilot or scheduled tuning cannot be triggered to adjust the resources of the deployment, and the automatic tuning configuration does not take effect.
Autopilot cannot resolve all performance bottlenecks of streaming deployments.
The performance bottlenecks of streaming deployments are determined based on all the upstream and downstream stores. If the performance bottleneck of a streaming deployment occurs on Realtime Compute for Apache Flink, you can use Autopilot to optimize the resource configuration. However, Autopilot may fail to work when some conditions are not met. For example, Autopilot may require that the traffic smoothly changes, no data skew exists, and the throughput of each operator expands linearly when the parallelism of the deployment increases. If the business logic of the deployment deviates significantly from the preceding conditions, some issues may occur. Examples:
The operation to modify the deployment parallelism cannot be triggered, or your deployment cannot reach a normal state and is repeatedly restarted.
The performance of user-defined scalar functions (UDFs), user-defined aggregate functions (UDAFs), or user-defined table-valued functions (UDTFs) deteriorates.
Autopilot cannot identify issues that occur on external systems. If these issues occur, you need to troubleshoot them.
If an external system fails or access to the external system requires a long period of time, the deployment parallelism increases. This increases the load on the external system. As a result, the external system breaks down. The following common issues may occur on external systems:
DataHub partitions are insufficient or the throughput of ApsaraMQ for RocketMQ is low.
The performance of the sink operator is low.
A deadlock occurs on an ApsaraDB RDS database.
During resource adjustment, the system compares resources and determines the resource adjustment method.
If the resource plan to be applied and the online adjustment involve the modification of CPU or memory, the deployment will stop and then start to make the modification take effect. This may cause service interruptions, data recovery delays, and startup failures due to insufficient resources. If only the parallelism is changed, the deployment is directly reconfigured by setting the parameters of dynamic scaling to reduce the service interruption time. For more information, see Dynamically update the parameter configuration for dynamic scaling.
Enable and configure Autopilot
Procedure
Go to the Autopilot Mode tab.
Log on to the Realtime Compute for Apache Flink console.
Find the workspace that you want to manage and click Console in the Actions column.
In the left-side navigation pane of the development console of Realtime Compute for Apache Flink, choose . On the Deployments page, find the desired deployment and click the name of the deployment.
On the Resources tab, click the Autopilot Mode tab.
Turn on Autopilot.
After you turn on Autopilot, Autopilot Mode Applying is displayed in the upper part of the Resources tab. If you want to disable Autopilot, you can turn off Autopilot or click Turn Off Autopilot in the upper-right corner of the Resources tab.
Click Edit in the upper-right corner of the Configurations section and modify Autopilot-related parameters. The following table describes the parameters.
Parameter | Description |
Autopilot Strategy | Stable Strategy: After this strategy is applied, the system searches for fixed resources or a scheduled tuning plan that is suitable for the entire running cycle and adjusts resources of the deployment based on the running status of the deployment in the entire cycle. This helps reduce the impact of start and stop operations on the deployment. This strategy helps the deployment run stably and reduces unnecessary changes and fluctuations to make the deployment reach the convergence state. Note The system dynamically adjusts resources of a deployment only when the system finds a resource configuration that is more suitable for the entire running cycle of the deployment. Otherwise, the system does not modify the existing resource configuration. After the deployment runs stably, the system generates a resource adjustment plan. You can save and apply the plan. For more information, see Save a resource adjustment plan. Adaptive Strategy: After this strategy is applied, the system dynamically modifies the resource configuration based on the real-time resources and metric information of the deployment. The system focuses on the current latency and resource usage of the deployment and quickly optimizes resource configurations based on the changes in related metrics. This strategy allows the system to quickly respond to deployment requirements and improves the efficiency and adaptability of resource configurations.
|
Cooldown Minutes | The time interval at which Autopilot is triggered after a deployment is restarted due to Autopilot. |
Max CPU | The maximum number of CPU cores that can be allocated for the automatic resource configuration of a deployment. The default value of this parameter varies based on the tuning strategy. |
Max Memory | The maximum amount of memory that can be allocated for the automatic resource configuration of a deployment. The default value of this parameter varies based on the tuning strategy. |
Max Delay | The maximum delay that is allowed. Unit: minutes or seconds. |
More Configurations | You can configure the following parameters for Stable Strategy and Adaptive Strategy: mem.scale-down.interval : the minimum interval at which Autopilot is triggered when the memory size is decreased.
Default value: 24. Unit: hours. The system checks the CPU utilization of the deployment at an interval of 24 hours. If the memory usage is less than the specified threshold, the system decreases the memory size or provides a recommendation for decreasing the memory size. parallelism.scale.max : the maximum parallelism when the value of the Parallelism parameter is increased.
Default value: -1. This value indicates that the maximum parallelism is not limited. Note For message queue services, such as ApsaraMQ for Kafka, ApsaraMQ, and Simple Log Service, the parallelism for automatic tuning is affected by the number of partitions and cannot exceed the number of partitions. If the maximum parallelism exceeds the number of partitions, the system automatically changes the value of parallelism to the number of partitions. parallelism.scale.min : the minimum parallelism when the value of the Parallelism parameter is decreased.
Default value: 1. This value indicates that the minimum parallelism is 1. delay-detector.scale-up.threshold : the maximum delay that is allowed. The throughput of the deployment is measured based on the delay of source data consumption.
Default value: 1. Unit: minutes. If the data processing capability is insufficient and the data processing delay is longer than 1 minute, the system performs the scale-up operation to increase the throughput of the deployment or the system provides a recommendation for performing the scale-up operation. The system can increase the parallelism or split chains to perform the scale-up operation. slot-usage-detector.scale-up.threshold : the threshold for monitoring the idle time of data processing operators to trigger the increase of parallelism, excluding source operators. If the percentage of time that a vertex operator spends processing data is continuously greater than the value of this parameter, the parallelism is increased to improve resource usage. Default value: 0.8.
slot-usage-detector.scale-down.threshold : the threshold for monitoring the idle time of data processing operators to trigger the decrease of parallelism, excluding source operators. If the percentage of time that a vertex operator spends processing data is continuously less than the value of this parameter, the parallelism is decreased to reduce resource usage. Default value: 0.2.
slot-usage-detector.scale-up.sample-interval : the interval at which the slot idle metric is monitored. This parameter can be used to calculate the average value of the idle time.
Default value: 3 minutes. This parameter takes effect together with the slot-usage-detector.scale-up.threshold and slot-usage-detector.scale-down.threshold parameters. If the average value of the idle time in a 3-minute period is greater than 0.8, the scale-up operation is performed. If the average value of the idle time in a 3-minute period is less than 0.2, the scale-down operation is performed. resources.memory-scale-up.max : the maximum memory size of a TaskManager and the JobManager.
Default value: 16. Unit: GiB. When a TaskManager and the JobManager perform Autopilot or increase the parallelism, the upper limit of memory is 16 GiB.
|
In the upper-right corner of the Configurations section, click Save.
Save a resource adjustment plan
After the deployment for which Stable Strategy is applied runs stably, the system automatically generates a fixed resource plan or a scheduled plan. You can manually view, analyze, save, or apply the resource adjustment plan. The following table describes the details of the plans.
Plan | Description | Remarks |
Fixed resource plan | Generates a single resource configuration that does not contain the time dimension. In the right-corner of the Resources tab, click Details. On the page that appears, set Recommended Plan to Specified resource and click Save. In the dialog box that appears, click Confirm. | After you click Confirm, the resource configurations on the Configurations tab are replaced by the estimated resource configurations. The new configurations are applied when the deployment is started next time. |
Scheduled plan (in public preview) | Generates time periods and resource configurations in each time period. You can save a scheduled plan and apply it. For more information, see Save and apply a scheduled plan. | After the scheduled plan is applied, the tuning mode is automatically changed from Autopilot to scheduled tuning. Resources are adjusted until the deployment runs stably. |
Configure and apply a scheduled plan
Procedure
Create and apply a scheduled plan
Save and apply a scheduled plan
Go to the Scheduled Mode tab.
Log on to the Realtime Compute for Apache Flink console.
Find the workspace that you want to manage and click Console in the Actions column.
In the left-side navigation pane, choose . On the Deployments page, click the name of the deployment that you want to manage.
On the Resources tab, click the Scheduled Mode tab.
Click New Plan.
In the Resource Setting section of the New Plan panel, configure the resource configuration parameters.
Trigger Period: You can select No Repeat, Every Day, Every Week, or Every Month from the drop-down list. If you set this parameter to Every Week or Every Month, you must specify the time range during which you want the plan to take effect.
Trigger Time: Specify the time at which you want the plan to take effect.
Mode: You can select Basic or Expert based on your business requirements. For more information, see Configure resources for a deployment.
Other parameters: For more information, see Parameters.
Optional. Click New Resource Setting Period and configure the Trigger Time parameter and resource configuration parameters.
You can configure resource tuning plans for multiple time periods in the same scheduled tuning plan.
Important
In the same scheduled tuning plan, the interval between the value of the Trigger Time parameter that is added after you click New Resource Setting Period and the existing value of the Trigger Time parameter must be greater than 30 minutes. Otherwise, the new resource configuration cannot be saved.
Find the desired scheduled tuning plan in the Resource Plans section of the Scheduled Mode tab, and click Apply in the Actions column.
After the deployment for which Stable Strategy is applied runs stably, the system automatically generates a scheduled plan. You can manually view, analyze, save, or apply the plan.
Go to the Autopilot Mode tab.
Log on to the management console of the Realtime Compute for Apache Flink.
Find the workspace that you want to manage and click Console in the Actions column.
In the left-side navigation pane, choose . On the Deployments page, click the name of the deployment that you want to manage.
Click the Autopilot tab.
In the right-corner of the Resources tab, click Details. On the page that appears, set Recommended Plan to Scheduled plan.
Configure the scheduled plan.
Action | Description | Remarks |
1. Specify the Max Change Count parameter. | You can specify the maximum number of changes that can be applied to the scheduled plan. | You can perform this action 2 to 5 times. |
2. Click Merge time periods. | You can merge time periods based on the maximum number of changes you specified. | You must scale resources up or down before merging to meet your business requirements. |
View and modify the merged resource configurations. For more information, see Configure resources for a deployment.
Click Save in the lower-left corner.
In the dialog box that appears, specify the Scheduled plan name parameter or select Apply this plan immediately and click Confirm.
After the scheduled plan is applied, the tuning mode is automatically changed from Autopilot to scheduled tuning. Resources are adjusted until the deployment runs stably.
Example
In this example, the peak hours of the deployment are from 09:00:00 to 19:00:00 every day. You can use 30 CUs for your deployment during peak hours. The off-peak hours of the deployment are from 19:00:00 to 09:00:00 of the next day. You can use 10 CUs for your deployment during off-peak hours. The following figure shows the configuration result of the tuning strategy in this example.

Default tuning actions of Autopilot
If you enable Autopilot, the system automatically optimizes resource configurations from the perspectives of parallelism and memory.
Autopilot enables the system to adjust the deployment parallelism to meet the throughput requirements, which change with the deployment traffic.
The system monitors the delay of consumption of the source data, the actual CPU utilization of TaskManagers, and the data processing capability of each operator to adjust the deployment parallelism. The system adjusts the deployment parallelism based on the following rules:
If the deployment delay does not exceed the default value of the deployment delay, the system does not modify the parallelism of the deployment. The default value is 60s.
If the deployment delay exceeds 60s, the system determines whether to increase the parallelism of the deployment based on the following conditions:
If the deployment delay is decreasing, the system does not adjust the parallelism of the deployment.
If the deployment delay continuously increases for 3 minutes (default value), the system adjusts the parallelism of the deployment to a value that is twice the processing capacity of the current actual transactions per second (TPS), but not greater than the maximum number of CUs. By default, the maximum number of CUs is 64.
If the delay metric does not exist for the deployment, the system adjusts the parallelism of the deployment based on the following conditions:
If the percentage of the data processing time of a vertex node exceeds 80% in six consecutive minutes, the system increases the parallelism of the deployment to reduce the value of slot-utilization to 50%. The number of CUs cannot exceed the specified maximum number of CUs. By default, the maximum number of CUs is 64.
If the average CPU utilization of all TaskManagers exceeds 80% in 6 minutes, the system increases the parallelism of the deployment to reduce the average CPU utilization to 50%.
If the maximum CPU utilization of all TaskManagers is less than 20% in 24 hours and the percentage of the data processing time of a vertex node is less than 20%, the system decreases the parallelism of the deployment to increase the CPU utilization and the percentage of the actual data processing time of the vertex node to 50%.
Autopilot also enables the system to monitor the memory usage and failovers of a deployment to adjust the memory configuration of the deployment. The system adjusts the memory size of the deployment based on the following rules:
If the JobManager encounters frequent garbage collections (GCs) or an out of memory (OOM) error, the system increases the memory size of the JobManager. By default, the memory size of the JobManager can be adjusted up to 16 GiB.
If frequent GCs, an OOM error, or a HeartBeatTimeout error occur on a TaskManager, the system increases the memory size of the TaskManager. By default, the maximum memory size of a TaskManager is 16 GiB.
If the memory usage of a TaskManager exceeds 95%, the system increases the memory size of the TaskManager.
If the actual memory usage of a TaskManager falls below 30% for 24 consecutive hours, the system decreases the memory size of the TaskManager. By default, the minimum memory size of a TaskManager is 1.6 GiB.
References
The intelligent deployment diagnostics feature can help you monitor the health status of your deployments and ensure the stability and reliability of your business. For more information, see Perform intelligent deployment diagnostics.
You can use deployment configurations and Flink SQL optimization to improve the performance of Flink SQL deployments. For more information, see Optimize Flink SQL.