Security white paper

Realtime Compute for Apache Flink is fully compatible with Apache Flink APIs and provides comprehensive security hardening features to ensure your data security in terms of access control, network, storage, backup and restoration, and ActionTrail.

Tenant isolation

Realtime Compute for Apache Flink supports multi-tenant scenarios. The Alibaba Cloud account authentication system uses symmetric encryption based on AccessKey pairs to perform signature authentication for each HTTP request from users. Data of different users is isolated and separately stored in a distributed file system. This meets the requirements for multi-user collaboration, data sharing, data confidentiality, and data security at the same time, and achieves real multi-tenant resource isolation.

Access control

Multi-dimensional access control methods are provided to ensure data security.

RAM

Alibaba Cloud provides Resource Access Management (RAM) to help you manage the operation permissions of different RAM users on Realtime Compute for Apache Flink resources. RAM also allows you to log on to the Realtime Compute for Apache Flink console as a member in a resource directory or a CloudSSO user. For more information, see What is RAM? and Supported logon methods.

Namespace permission management

Realtime Compute for Apache Flink allows you to manage permissions on namespaces in a flexible and secure manner. You can define roles and configure fine-grained permissions based on your business requirements when multiple users perform deployment development and O&M in the same namespace. For more information, see Grant permissions on namespaces.

Whitelist

By default, the upstream and downstream storage devices of Realtime Compute for Apache Flink deny access from external devices. Therefore, you need to add the CIDR block of the vSwitch of Realtime Compute for Apache Flink to the whitelist of the storage system that Realtime Compute for Apache Flink needs to access. If your vSwitch is not in the same zone as the upstream and downstream storage systems, the network connection can be established after you add the CIDR block of the vSwitch to the whitelist. For more information, see FAQ about network connectivity.

Access to a Hive cluster that supports Kerberos authentication

Kerberos is a computer-network authentication protocol that is used for identity authentication to ensure the security of communication. If the Hive cluster that your Realtime Compute for Apache Flink deployment needs to access supports Kerberos authentication, you must register the Hive cluster in the Realtime Compute for Apache Flink console and configure the Hive cluster in the deployment. For more information, see Register a Hive cluster that supports Kerberos authentication.

Network isolation

Realtime Compute for Apache Flink can access upstream and downstream storage services over virtual private clouds (VPCs) or the Internet. To ensure security, we recommend that you configure access over VPCs. You can also manage the domain names of upstream and downstream storage services in the Realtime Compute for Apache Flink console.

VPC

A VPC is a private network that is isolated from other networks at the network layer on top of physical-layer protocols. VPCs provide high security, reliability, flexibility, scalability, and ease of use. For more information, see What is a VPC?

Internet

You can use Network Address Translation (NAT) gateways of Alibaba Cloud to set up connections between VPCs and the Internet. This way, Realtime Compute for Apache Flink can access upstream and downstream services over the Internet. This access method is not recommended. For more information about network connectivity, see FAQ about network connectivity.

Domain name management

You can manage the domain names of upstream and downstream services in the Realtime Compute for Apache Flink console.

Encryption

Key management

You can configure keys in DDLs or the log configuration of SQL deployments to prevent security risks that are caused by plaintext AccessKey pairs. For more information, see Manage variables.

Backup and restoration

Multiple backup methods are provided to persist and restore data.

Data backup

Realtime Compute for Apache Flink uses a storage and computing separation architecture and uses Object Storage Service (OSS) to store information, such as checkpoints, savepoints, logs, and JAR packages of deployments. Realtime Compute for Apache Flink creates different directories in the bucket that you select to store different types of data. The default retention period is seven days. For more information, see Activate Realtime Compute for Apache Flink.

Data restoration

Manually create a savepoint: If you want to manually create a savepoint for a deployment at a specific point in time, such as the time when the deployment is running or when the deployment is canceled, and want to restore the deployment from the savepoint, you can manually create a savepoint. This feature can be used in scenarios such as data restoration, quick business deployment, and data verification.
Configure scheduled generation of savepoints: If you want the system to automatically create savepoints as scheduled, you can specify a deployment savepoint creation cycle. After the rules for scheduled generation of savepoints are saved, the system automatically creates savepoints when the deployment is running. You do not need to manually create savepoints.
Restore a deployment from a specified savepoint of another deployment: If you want to restore a deployment from a specified savepoint of another deployment, you can specify a savepoint to restore the deployment.
Note
If you want to share savepoints across deployments, you must make sure that the state data between deployments is compatible. For example, you can perform a dual-run test to check the compatibility of state data between deployments.

Deployment status backup

You can perform the following operations to view the status set of a deployment: On the Deployments page in the Realtime Compute for Apache Flink console, click the name of the desired deployment. On the deployment details page, click the Status tab. For more information, see View the state generation overview.

Quick task restart

If a task in a streaming deployment fails, all tasks in the same pipeline region perform a failover to ensure data consistency. After a deployment performs a failover, the source node needs to start consuming data from the previous checkpoint. However, after the failover for the tasks of specific deployments is complete, you need to download large resource files or state data. If the parallelism of deployments is high, the tasks of the deployments may take a long period of time to perform a failover. As a result, the deployments may be delayed or blocked and cannot consume data within a specific period of time. The deployments may take a long period of time to recover. Quick task restart can effectively resolve the preceding issue. For more information, see Configure quick task restart.

Cross-zone high availability

The cross-zone high availability feature is supported to implement zone-disaster recovery. The scheduling and switchover capabilities can be implemented in namespaces that use cross-zone compute units across zones in the same city. If faults occur in a zone to which the namespace belongs, the deployments in the namespace are restored in the secondary zone. This effectively prevents service interruptions caused by faults in a single zone and ensures the continuity and high availability of deployments. For more information about cross-zone high availability, see Cross-zone high availability.

ActionTrail

ActionTrail is a service that monitors and records the operations of your Alibaba Cloud account. The operations include your access to and use of cloud services by using the Alibaba Cloud Management Console, APIs, and SDKs. ActionTrail records these operations as events. You can download these events and deliver them to Simple Log Service Logstores or OSS buckets. Then, you can perform behavior analysis, security analysis, resource change tracking, and compliance auditing based on the events. For more information about ActionTrail, see What is ActionTrail?

Realtime Compute for Apache Flink is connected to ActionTrail. You can view resource operation events and related information in the ActionTrail console free of charge. For more information, see View audit events of Realtime Compute for Apache Flink.