Introduction to EMR DataScience

By Li Bo (Aohai), an AI product expert at Alibaba Cloud who has worked in the AI industry for five years and is responsible for AI platform products

This article provides an overview of the DataScience node of E-MapReduce, including the DataScience deep learning framework, PAI-Alink machine learning algorithm platform for unified streaming and batch processing, and DataScience atomic components, such as AutoML, FaissServer, and PAI-EMS.

This article was adapted from a PowerPoint presentation and can be divided into two main parts:

DataScience Overview
DataScience Atomic Components

DataScience Overview

DataScience is an E-MapReduce (EMR) compute node based on AI services. It was built by the Alibaba Cloud Platform for AI (PAI) team based on open-source big data frameworks and systems. We will create this node before the Spark "Digital Body" AI Challenge. You can go to the EMR product, select the DataScience node, and select all the components that you want to use. DataScience is compatible with Hadoop 3.x and EMR 4.2.x.

The following figure shows the capabilities provided by DataScience. DataScience provides end-to-end services throughout the lifecycle of machine learning data modeling. The underlying storage layer can read data from HDFS and OSS. The computing framework layer consists of two parts. The first is the traditional machine learning framework that provides services through AlinkServer and is built on the underlying commercial Flink framework VVP. The other is the deep learning framework that includes TensorFlow and PyTorch. You can use AlinkServer to build a traditional machine learning model or use TensorFlow and PyTorch to build a deep learning model. This AI Challenge focuses on images. Therefore, TensorFlow and PyTorch are more frequently used. After you preprocess and model data based on the computing framework and algorithm, you need to tune parameters. The Alibaba Cloud PAI team provides the AutoML tool for parameter tuning. At the algorithm layer, you can use your own algorithm. At the service layer, the online model needs to be used in an actual industry environment, and PAI-EASCMD or PAI-FaissServer may be used.

DataScience Atomic Components

DataScience Deep Learning Framework

Currently, the platform has embedded the TensorFlow and PyTorch deep learning frameworks for contestants. You can use these frameworks and write code in a Python 3 environment. Deep learning data modeling depends on a lot of third-party libraries. You can run the pip3 install command to install related dependencies. Contestants can use vim to develop code. Those who are unfamiliar with vim can use Apache Zeppelin for interactive development. Apache Zeppelin supports shell operations.

PAI-Alink Machine Learning Algorithm Platform for Unified Streaming and Batch Processing

Although most contestants do not use traditional machine learning algorithms, those who need them can use Alink. Alink supports more than 350 traditional machine learning algorithms that cover the entire lifecycle of machine learning, such as data preprocessing, feature engineering, model training, and model evaluation. Traditional machine learning algorithms include K-Means and random forest. Alink supports streaming and offline algorithms, and allows you to drag and drop components. Alink supports multiple visualization methods for viewing the experiment results.

The following figure shows an Alink experiment demo. The blue part indicates a streaming algorithm, and the yellow part indicates an offline algorithm.

AutoML

AutoML is a common component used in competitions. To get good results, you need to tune the parameters to find the right combination of algorithms in addition to building models. Alibaba seldom uses manual parameter tuning. Alibaba has embedded AutoML into DataScience for this AI Challenge. To use AutoML, you need to first create a modeling script. In the script, many parameters need to be tuned, such as max_depth, learning_rate, and train_id. You can use parser in the code to set the parameters to tune. In addition, you need to create a script for parameter tuning, import pai.automl.hop, map the preceding parameters, and enumerate the parameters to configure. If you do not want to use the enumeration method, you can use the random sampling method. After you specify a range, the platform can perform random sampling within the range. The right side of the following figure shows the final parameter tuning results for different parameter combinations. In Step 2 in the following figure, in addition to enumerating the parameters to set, you need to set a metric, such as the accuracy or recall rate, as the evaluation standard. You can also customize a metric. AutoML eliminates manual parameter tuning.

FaissServer

In machine learning application scenarios that need to calculate the vector similarity in real time, FaissServer can quickly calculate the distance between a given vector and other vectors. After you load all vectors to FaissServer and send a GRPC query, FaissServer will report the top N vectors. FaissServer is built into DataScience. You can import generated vectors into FaissServer to build an online top N vector query engine. FaissServer is frequently used for image similarity analysis and querying.

PAI-EAS Model Online Service

EAS may be used in the finals. It focuses on how to better models on business terminals, such as mobile phones and Internet of Things (IoT) devices. You can deploy deep learning models as online services using PAI-EAS CMD embedded in DataScience and call RESTful APIs to use these deep learning models in your businesses. EAS supports phased release, online service monitoring, version control, and other features.

Learn more about Alibaba Cloud E-MapReduce at https://www.alibabacloud.com/products/emapreduce

Community

Introduction to EMR DataScience

DataScience Overview

DataScience Atomic Components

DataScience Deep Learning Framework

PAI-Alink Machine Learning Algorithm Platform for Unified Streaming and Batch Processing

AutoML

FaissServer

PAI-EAS Model Online Service

Read previous post:

Read next post:

Alibaba EMR

You may also like

Comments

Alibaba EMR

Related Products

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

E-MapReduce Service

ApsaraDB for HBase