By Chen Wuchao (Zhongzhuo), Alibaba Technical Expert
Deep learning is increasingly essential in our contemporary society. At present, deep learning is widely used in a variety of fields, such as personalized recommendations, product search, face recognition, machine translation, and autonomous driving. It is also rapidly penetrating into all aspects of society.
With the increasing diversity of deep learning applications, many excellent computing frameworks have emerged. Among them, TensorFlow, PyTorch, and MXNeT are widely used and have attracted a great deal of attention. Computing frameworks for data processing are often used to apply deep learning to actual scenarios. For example, training data must be processed to create training samples before model training. Data processing metrics must be monitored during model prediction. Different computing engines are required to implement data processing and model training, which increases the difficulties faced by users.
This article explains how to use a single engine for implementing the entire machine learning process. The following figure shows a typical machine learning workflow, which consists of feature engineering, model training, and offline or online model prediction.
Logs are generated in each phase of the machine learning process. To start, we need to use a data processing engine, such as Flink, to analyze these logs before proceeding to feature engineering. Then, we use TensorFlow, a computing engine for deep learning, to complete model training and prediction. After model training, we use TensorFlow Serving for online scoring.
This process is feasible, but it leads to some problems:
1) In a single machine learning project, we must use two computing engines, Flink and TensorFlow, to implement feature engineering, model training, and model prediction. It is difficult to deploy the two engines.
2) TensorFlow is inconvenient to use in distributed environments because it requires designating machine IP addresses and port numbers. However, actual production processes are often implemented in a scheduling system, such as YARN, which requires the dynamical allocation of IP addresses and port numbers.
3) TensorFlow does not support automatic failover for distributed operations. Run TensorFlow in a Flink cluster to solve the preceding problems. The following figure shows the schematic diagram of a system that uses TensorFlow on Flink.
Feature engineering is implemented by Flink. Model training and quasi-real-time model prediction are implemented by TensorFlow, which runs in a Flink cluster. Model training and prediction are implemented by Flink alone, which simplifies deployment and saves resources.
Flink is an open-source big data distributed computing engine. All computations on Flink are abstracted into operators, as shown in the preceding figure. The nodes that read data are called source operators, and the nodes that output data are called sink operators. A variety of Flink operators implement the processes that exist between the source and the sink operators. The preceding computing topology includes three source operators and two sink operators.
The following figure shows the distributed topology of machine learning.
Nodes in a machine learning cluster are often divided into different groups, as shown in the preceding figure. A group of nodes may contain workers to run algorithms or contain parameter servers (PSs) to update parameter settings.
How do we combine the operator structure of Flink with the nodes and application managers of machine learning? The following section explains the abstraction of Flink-AI-Extended in detail.
The machine learning cluster is abstracted into the ML framework and includes the ML operator. The two modules combine Flink and the machine learning cluster, providing support for different computing engines, such as TensorFlow. This is shown in the following figure.
The Flink runtime environment is abstracted into the ML framework and ML operator, which connect Flink to other computing engines.
The ML framework has two roles: an application manager and a node.
1) An application manager manages the lifecycles of all its nodes.
2) Nodes are responsible for running the algorithm programs for machine learning.
Application managers and nodes are further abstracted. The state machine of an application manager is extended to support different types of jobs.
Deep learning engines support state machine customization. A node is abstracted into a runner interface to create custom algorithm programs based on different deep learning engines.
The ML operator provides the following two interfaces:
1) The addAMRole interface is used to add an application manager to a Flink job. As shown in the preceding figure, the application manager is a management node of the machine learning cluster.
2) The addRole interface is used to add a group of machine learning nodes.
Use these two interfaces of the ML operator to add the Flink operators: an application manager and three groups of nodes, which are called Role A, Role B, and Role C, respectively. The three node groups form a machine learning cluster. See the code in the preceding figure. Each Flink operator corresponds to a node of the machine learning job.
A machine learning node runs on a Flink operator and they need to exchange data with each other, as shown in the following figure.
The Flink operator is a Java process, and the machine learning node is a Python process. The two processes exchange data with each other through memory sharing.
TensorFlow distributed training involves two roles: worker and PS. Workers implement computing for machine learning and PSs update parameter settings. The following sections describe how TensorFlow runs in a Flink cluster.
In Batch mode, sample data can be stored in the Hadoop distributed file system (HDFS). The Flink job starts a source operator, and then the worker role of TensorFlow is started. As shown in the preceding figure, if the worker role has three nodes, its source parallelism is set to 3. Similarly, the PS role has two nodes, so its source parallelism is set to 2. The application manager does not exchange data with other roles, so it is an independent node with an invariable source parallelism of 1. The Flink job starts three worker nodes and two PS nodes, which communicate with each other through TensorFlow gRPC rather than Flink's communication mechanism.
As shown in the preceding figure, two source operators are connected to the join operator. Two pieces of data are merged into one and sample data is generated by a custom processing node. In Stream mode, the worker role is implemented by UDTF or flatMap.
When three TensorFlow worker nodes exist, the operator parallelism of flatMap and UDTF is set to 3. The PS role does not read data, so PSs are implemented by Flink source operators.
The following sections describe how to implement real-time prediction using a trained model.
The following figure shows the process of model prediction using Python. Some scenarios, such as recommendation and search, may result in large models that are trained by TensorFlow in distributed mode. A model of this type cannot be stored on a single machine. Real-time prediction works in the same way as real-time training but includes the extra process of model loading.
During model prediction, all parameters are loaded to PSs through model reading. The upstream data is processed in the same way as data is processed during the model training process. Data is transmitted to worker nodes for processing. The prediction score is written to Flink operators and then sent to downstream operators.
As shown in the preceding figure, when prediction takes place on a single machine, PS nodes are not started because a single worker node is large enough to store the model used for prediction, especially when a saved model is exported by TensorFlow. The saved model includes the complete computing logic, input, and output of prediction, so prediction can be performed without running Python code.
There is another way to implement model prediction. The source operators, join operator, and UDTF are used to process data in a format that is recognized during the model prediction process. Use the TensorFlow Java API in the Java process to directly load the trained model to the memory. In this case, the PS role is not required, and the worker role is assumed by the Java process rather than the Python process. Then, model prediction can be directly implemented in the Java process, and the prediction results can be sent to the downstream Flink operators.
This article explains how Flink-AI-Extended works and how to implement model training and prediction through TensorFlow on Flink. I hope this article helps you use Flink-AI-Extended effectively and implement model training and prediction through a Flink job.
150 posts | 43 followers
FollowApache Flink Community China - August 2, 2019
Alibaba Clouder - October 15, 2019
Apache Flink Community China - September 16, 2020
Apache Flink Community China - September 16, 2020
Apache Flink Community China - April 13, 2022
Apache Flink Community China - September 15, 2022
150 posts | 43 followers
FollowA platform that provides enterprise-level data modeling services based on machine learning algorithms to quickly meet your needs for data-driven operations.
Learn MoreThis technology can be used to predict the spread of COVID-19 and help decision makers evaluate the impact of various prevention and control measures on the development of the epidemic.
Learn MoreAlibaba Cloud (in partnership with Whale Cloud) helps telcos build an all-in-one telecommunication and digital lifestyle platform based on DingTalk.
Learn MoreRelying on Alibaba's leading natural language processing and deep learning technology.
Learn MoreMore Posts by Apache Flink Community