EasyTransfer is designed to help developers develop transfer learning models in natural language processing (NLP) scenarios. This topic uses text classification as an example to describe how to use EasyTransfer to train models, evaluate models, use models to make predictions, export model files, and deploy models in Data Science Workshop (DSW) of Machine Learning Platform for AI (PAI).
Prerequisites
A DSW instance is created and the software version requirements are met. For more information, see Create a DSW instance and Limits.
We recommend that you use a GPU-accelerated DSW instance.
Background information
Transfer learning is a machine learning method of applying knowledge acquired from one resolved problem to a different problem. Industrial production shows a growing need for applying transfer learning to NLP applications. The adoption of conventional machine learning in emerging industries significantly increases the investment in manpower and resources for accumulating large volumes of training data. To resolve this issue, developers can reuse the training data of an existing task to improve the performance of learning in a new task. PAI provides EasyTransfer, a deep learning framework, to help developers develop transfer learning models for NLP applications.
Limits
EasyTransfer supports the following Python and TensorFlow versions:
Python: Python 2.7, Python 3.4, or versions later than Python 3.4.
Image: the official image tensorflow:1.12PAI-gpu-py36-cu101-ubuntu18.04.
Step 1: Prepare data
Go to the development environment of Data Science Workshop (DSW).
Log on to the PAI console.
In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.
In the upper-left corner of the page, select the region where you want to use PAI.
In the left-side navigation pane, choose .
(Optional.) On the Interactive Modeling (DSW) page, enter the name of a DSW instance or a keyword in the search box to search for the DSW instance.
Find the DSW instance and click Launch in the Actions column.
In the development environment of DSW, click Terminal in the top navigation bar and follow the on-screen instructions to launch Terminal.
Run the following commands in the terminal to download the sample datasets:
wget http://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/tutorial/ez_text_classify/zqkd_sample/train.csv wget http://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/tutorial/ez_text_classify/zqkd_sample/dev.csv
NoteThe datasets used in this example are only for demonstration. You may need more datasets when you train a news classification model.
Step 2: Start a training task in the current directory
Run the following command to start a training task:
easy_transfer_app \
--mode=train \
--modelName=text_classify_bert \
--inputTable="./train.csv,./dev.csv" \
--inputSchema=content:str:1,label:str:1 \
--firstSequence=content \
--labelName=label \
--labelEnumerateValues="教育,三农,娱乐,健康,美文,搞笑,美食,财经,科技,旅游,汽车,时尚,科学,文化,房产,热点,母婴,家居,体育,国际,育儿,宠物,游戏,健身,职场,读书,艺术,动漫" \
--sequenceLength=128 \
--checkpointDir=./classify_models \
--batchSize=64 \
--numEpochs=3 \
--optimizerType=adam \
--learningRate=3e-5 \
--advancedParameters='\
pretrain_model_name_or_path=pai-bert-base-zh \
'
The following table describes the parameters.
Parameter | Required | Description | Default value | Type |
mode | Yes | The mode that is used. Valid values:
| None | STRING |
modelName | No | The name of the model. Valid values:
| text_match_bert | STRING |
inputTable | Yes | The input table for model training. Separate multiple tables with commas (,). Example: | None | STRING |
inputSchema | Yes | The schema of the columns in the input table. The value must be in the following format: Column name:Type:Length. The following information is used:
| None | STRING |
firstSequence | Yes | The column that corresponds to the first text sequence in the input table. | None | STRING |
labelName | No | The name of the label column in the input table. | Empty string "" | STRING |
labelEnumerateValues | No | The enumerate values of labels. You can specify the values by using one of the following methods:
| Empty string "" | STRING |
sequenceLength | No | The maximum sequence length. Valid values: 1 to 512. | 128 | INT |
checkpointDir | Yes | The directory of the model. Example: | None | STRING |
batchSize | No | The size of each training batch. If multiple GPUs are used for model training, this parameter specifies the size of each batch scheduled to each GPU. | 32 | INT |
numEpochs | No | The number of epochs for model training. | 1 | INT |
optimizerType | No | The type of optimizer. Valid values:
| adam | STRING |
learningRate | No | The learning rate. | 2e-5 | FLOAT |
advancedParameters | No | Other advanced parameters. For more information, refer to the following table. | None | STRING |
The following table describes the advanced parameters.
Parameter | Required | Description | Default value | Type |
pretrain_model_name_or_path | No | The pre-trained model. You can specify a pre-trained model provided by EasyTransfer or specify the Object Storage Service (OSS) path of a custom pre-trained model. | pai-bert-base-zh | STRING |
Step 3: Evaluate the model
After you train the model, run the following command to test or evaluate the training result:
easy_transfer_app \
--mode=evaluate \
--inputTable=./dev.csv \
--checkpointPath=./classify_models/model.ckpt-64 \
--batchSize=10
The following table describes the parameters.
Parameter | Required | Description | Default value | Type |
mode | Yes | The mode that is used. Valid values:
| None | STRING |
inputTable | Yes | The input table for model evaluation. Separate multiple tables with commas (,). Example: Important The column schemas of the datasets for model training and model evaluation must be the same. | None | STRING |
checkpointPath | Yes | The directory of the CKPT file for the model. Example: ./classify_models/model.ckpt-32. | None | STRING |
batchSize | No | The size of each evaluation batch. If multiple GPUs are used, this parameter specifies the size of each batch scheduled to each GPU. | 32 | INT |
Step 4: Use the model to make predictions
After you train the model, run the following command to use the model to process a file. The file can be unlabeled.
easy_transfer_app \
--mode=predict \
--inputSchema=content:str:1,label:str:1 \
--inputTable=dev.csv \
--outputTable=dev.pred.csv \
--firstSequence=content \
--appendCols=label \
--outputSchema=predictions,probabilities,logits \
--checkpointPath=./classify_models/ \
--batchSize=100
The following table describes the parameters.
Parameter | Required | Description | Default value | Type |
mode | Yes | The mode that is used. Valid values:
| None | STRING |
inputTable | Yes | The input table to be processed by the model. Example: | None | STRING |
outputTable | Yes | The output table that stores the prediction result. Example: | None | STRING |
inputSchema | Yes | The schema of the columns in the input table. The value must be in the following format: Column name:Type:Length. The following information is used:
| None | STRING |
firstSequence | Yes | The column that corresponds to the first text sequence in the input table. | None | STRING |
appendCols | No | The columns to be appended from the input table to the output table. | Empty string "" | STRING |
outputSchema | No | The types of predicted values that you want the model to output. Separate multiple types with commas (,). The following types of predicted values are supported:
| predictions | STRING |
checkpointPath | Yes | The directory of the model. Example: | None | STRING |
batchSize | No | The size of each training batch. If multiple GPUs are used for model training, this parameter specifies the size of each batch scheduled to each GPU. | 32 | INT |
Step 5: Export the model files and deploy the model as an online Elastic Algorithm Service (EAS) service
Export the model files.
By default, the system automatically exports the variables and the saved_model.pb file of the last checkpoint after the model is trained. If you want to export the training results of other check points, run the following command:
easy_transfer_app \ --mode=export \ --exportType=app_model \ --checkpointPath=./classify_models/model.ckpt-64 \ --exportDirBase=./export_model \ --batchSize=100
The following table describes the parameters.
Parameter
Required
Description
Default value
Type
mode
Yes
The mode that is used. Valid values:
train
evaluate
predict
export
None
STRING
exportType
Yes
The type of model files that you want to export. Valid values:
app_model: Export finetune model files.
ez_bert_feat: Export model files that are required by text vectorization components.
None
STRING
checkpointPath
Yes
The directory of the CKPT file for the model.
None
STRING
exportDirBase
Yes
The directory of the exported model files.
None
STRING
batchSize
No
The size of each evaluation batch. If multiple GPUs are used, this parameter specifies the size of each batch scheduled to each GPU.
32
INT
Package the model files.
Package the exported variables, saved_model.pb, and vocab.txt files and the label_mapping file that is used to customize input. For example, the label_mapping file of a news classification model is label_mapping.json. The label IDs in the file must be of the INT type. The label IDs must be sorted in the same order as the enumerate values specified in the labelEnumerateValues parameter. The following code block shows an example of the label_mapping.json file:
{"教育": 0, "三农": 1, ..., "动漫": 27}
You can find the label_mapping.json file in the directory specified in the checkpointDir parameter.
The following figure shows the files that are packaged.
Upload the package to OSS and record the OSS path of the package. Example: oss://xxx/your_model.zip.
Deploy the model. For more information, see EasyTransfer Processor.