All Products
Search
Document Center

Platform For AI:Use EasyTransfer to develop a text classification model

Last Updated:May 21, 2024

EasyTransfer is designed to help developers develop transfer learning models in natural language processing (NLP) scenarios. This topic uses text classification as an example to describe how to use EasyTransfer to train models, evaluate models, use models to make predictions, export model files, and deploy models in Data Science Workshop (DSW) of Machine Learning Platform for AI (PAI).

Prerequisites

A DSW instance is created and the software version requirements are met. For more information, see Create a DSW instance and Limits.

Note

We recommend that you use a GPU-accelerated DSW instance.

Background information

Transfer learning is a machine learning method of applying knowledge acquired from one resolved problem to a different problem. Industrial production shows a growing need for applying transfer learning to NLP applications. The adoption of conventional machine learning in emerging industries significantly increases the investment in manpower and resources for accumulating large volumes of training data. To resolve this issue, developers can reuse the training data of an existing task to improve the performance of learning in a new task. PAI provides EasyTransfer, a deep learning framework, to help developers develop transfer learning models for NLP applications.

Limits

EasyTransfer supports the following Python and TensorFlow versions:

  • Python: Python 2.7, Python 3.4, or versions later than Python 3.4.

  • Image: the official image tensorflow:1.12PAI-gpu-py36-cu101-ubuntu18.04.

Step 1: Prepare data

  1. Go to the development environment of Data Science Workshop (DSW).

    1. Log on to the PAI console.

    2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.

    3. In the upper-left corner of the page, select the region where you want to use PAI.

    4. In the left-side navigation pane, choose Model Development and Training > Interactive Modeling (DSW).

    5. (Optional.) On the Interactive Modeling (DSW) page, enter the name of a DSW instance or a keyword in the search box to search for the DSW instance.

    6. Find the DSW instance and click Launch in the Actions column.

  2. In the development environment of DSW, click Terminal in the top navigation bar and follow the on-screen instructions to launch Terminal.

  3. Run the following commands in the terminal to download the sample datasets:

    wget http://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/tutorial/ez_text_classify/zqkd_sample/train.csv
    wget http://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/tutorial/ez_text_classify/zqkd_sample/dev.csv
    Note

    The datasets used in this example are only for demonstration. You may need more datasets when you train a news classification model.

Step 2: Start a training task in the current directory

Run the following command to start a training task:

easy_transfer_app \
  --mode=train \
  --modelName=text_classify_bert \
  --inputTable="./train.csv,./dev.csv" \
  --inputSchema=content:str:1,label:str:1 \
  --firstSequence=content \
  --labelName=label \
  --labelEnumerateValues="教育,三农,娱乐,健康,美文,搞笑,美食,财经,科技,旅游,汽车,时尚,科学,文化,房产,热点,母婴,家居,体育,国际,育儿,宠物,游戏,健身,职场,读书,艺术,动漫" \
  --sequenceLength=128 \
  --checkpointDir=./classify_models \
  --batchSize=64 \
  --numEpochs=3 \
  --optimizerType=adam \
  --learningRate=3e-5 \
  --advancedParameters='\
    pretrain_model_name_or_path=pai-bert-base-zh \
    '

The following table describes the parameters.

Parameter

Required

Description

Default value

Type

mode

Yes

The mode that is used. Valid values:

  • train

  • evaluate

  • predict

  • export

None

STRING

modelName

No

The name of the model. Valid values:

  • The value of the parameter is text_classify_bert if the model is a BERT model for text classification.

  • The value of the parameter is text_classify_dgcnn if the model is a DGCNN model for text classification.

  • The value of the parameter is text_match_bert if the model is a BERT model for text matching.

  • The value of the parameter is text_match_bert_two_tower if the model is a two-tower BERT model for text matching.

  • The value of the parameter is text_match_bicnn if the model is a BiCNN model (two-tower CNN model).

  • The value of the parameter is text_match_hcnn if the model is an HCNN model.

  • The value of the parameter is text_match_dam if the model is a DAM model.

  • The value of the parameter is text_match_damplus if the model is a DAM+ model.

  • The value of the parameter is text_classify_cnn if the model is a TextCNN model.

  • The value of the parameter is text_comprehension_bert if the model is a BERT model for reading comprehension.

  • The value of the parameter is text_comprehension_bert_hae if the model is a BERT-HAE model.

  • The value of the parameter is sequence_labeling_bert if the model is a BERT model for sequence labeling.

text_match_bert

STRING

inputTable

Yes

The input table for model training. Separate multiple tables with commas (,). Example: ./train.csv,./dev.csv.

None

STRING

inputSchema

Yes

The schema of the columns in the input table. The value must be in the following format: Column name:Type:Length. The following information is used:

  • The valid values of Type are int, str, and float.

  • In most cases, the value of Length is 1. If the column is a comma-separated array, the value of Length equals the length of the array.

None

STRING

firstSequence

Yes

The column that corresponds to the first text sequence in the input table.

None

STRING

labelName

No

The name of the label column in the input table.

Empty string ""

STRING

labelEnumerateValues

No

The enumerate values of labels. You can specify the values by using one of the following methods:

  • Directly specify the enumerate values and separate them with commas (,).

  • Specify the path of a TXT file. The TXT file contains the enumerate values that are separated by line feeds.

Empty string ""

STRING

sequenceLength

No

The maximum sequence length. Valid values: 1 to 512.

128

INT

checkpointDir

Yes

The directory of the model. Example: ./classify_models.

None

STRING

batchSize

No

The size of each training batch. If multiple GPUs are used for model training, this parameter specifies the size of each batch scheduled to each GPU.

32

INT

numEpochs

No

The number of epochs for model training.

1

INT

optimizerType

No

The type of optimizer. Valid values:

  • adam

  • lamb

  • adagrad

  • adadeleta

adam

STRING

learningRate

No

The learning rate.

2e-5

FLOAT

advancedParameters

No

Other advanced parameters. For more information, refer to the following table.

None

STRING

The following table describes the advanced parameters.

Parameter

Required

Description

Default value

Type

pretrain_model_name_or_path

No

The pre-trained model. You can specify a pre-trained model provided by EasyTransfer or specify the Object Storage Service (OSS) path of a custom pre-trained model.

pai-bert-base-zh

STRING

Step 3: Evaluate the model

After you train the model, run the following command to test or evaluate the training result:

easy_transfer_app \
  --mode=evaluate \
  --inputTable=./dev.csv \
  --checkpointPath=./classify_models/model.ckpt-64 \
  --batchSize=10

The following table describes the parameters.

Parameter

Required

Description

Default value

Type

mode

Yes

The mode that is used. Valid values:

  • train

  • evaluate

  • predict

  • export

None

STRING

inputTable

Yes

The input table for model evaluation. Separate multiple tables with commas (,). Example: ./dev.csv.

Important

The column schemas of the datasets for model training and model evaluation must be the same.

None

STRING

checkpointPath

Yes

The directory of the CKPT file for the model. Example: ./classify_models/model.ckpt-32.

None

STRING

batchSize

No

The size of each evaluation batch. If multiple GPUs are used, this parameter specifies the size of each batch scheduled to each GPU.

32

INT

Step 4: Use the model to make predictions

After you train the model, run the following command to use the model to process a file. The file can be unlabeled.

easy_transfer_app \
  --mode=predict \
  --inputSchema=content:str:1,label:str:1 \
  --inputTable=dev.csv \
  --outputTable=dev.pred.csv \
  --firstSequence=content \
  --appendCols=label \
  --outputSchema=predictions,probabilities,logits \
  --checkpointPath=./classify_models/ \
  --batchSize=100

The following table describes the parameters.

Parameter

Required

Description

Default value

Type

mode

Yes

The mode that is used. Valid values:

  • train

  • evaluate

  • predict

  • export

None

STRING

inputTable

Yes

The input table to be processed by the model. Example: ./dev.csv.

None

STRING

outputTable

Yes

The output table that stores the prediction result. Example: ./dev.pred.csv.

None

STRING

inputSchema

Yes

The schema of the columns in the input table. The value must be in the following format: Column name:Type:Length. The following information is used:

  • The valid values of Type are int, str, and float.

  • In most cases, the value of Length is 1. If the column is a comma-separated array, the value of Length equals the length of the array.

None

STRING

firstSequence

Yes

The column that corresponds to the first text sequence in the input table.

None

STRING

appendCols

No

The columns to be appended from the input table to the output table.

Empty string ""

STRING

outputSchema

No

The types of predicted values that you want the model to output. Separate multiple types with commas (,). The following types of predicted values are supported:

  • predictions: If you use a single-label classification model, the model outputs the IDs of all categories that are sorted in the same order as the enumerate values specified in the labelEnumerateValue parameter If you use a multi-label classification model, the model outputs multi-hot vectors that are separated by commas (,).

  • probabilities: The model outputs the probabilities of all categories that are separated by commas (,).

  • logits: The model outputs the logit values of all categories that are separated by commas (,).

predictions

STRING

checkpointPath

Yes

The directory of the model. Example: ./bert_classify_models.

None

STRING

batchSize

No

The size of each training batch. If multiple GPUs are used for model training, this parameter specifies the size of each batch scheduled to each GPU.

32

INT

Step 5: Export the model files and deploy the model as an online Elastic Algorithm Service (EAS) service

  1. Export the model files.

    By default, the system automatically exports the variables and the saved_model.pb file of the last checkpoint after the model is trained. If you want to export the training results of other check points, run the following command:

    easy_transfer_app \
      --mode=export \
      --exportType=app_model \
      --checkpointPath=./classify_models/model.ckpt-64 \
      --exportDirBase=./export_model \
      --batchSize=100

    The following table describes the parameters.

    Parameter

    Required

    Description

    Default value

    Type

    mode

    Yes

    The mode that is used. Valid values:

    • train

    • evaluate

    • predict

    • export

    None

    STRING

    exportType

    Yes

    The type of model files that you want to export. Valid values:

    • app_model: Export finetune model files.

    • ez_bert_feat: Export model files that are required by text vectorization components.

    None

    STRING

    checkpointPath

    Yes

    The directory of the CKPT file for the model.

    None

    STRING

    exportDirBase

    Yes

    The directory of the exported model files.

    None

    STRING

    batchSize

    No

    The size of each evaluation batch. If multiple GPUs are used, this parameter specifies the size of each batch scheduled to each GPU.

    32

    INT

  2. Package the model files.

    Package the exported variables, saved_model.pb, and vocab.txt files and the label_mapping file that is used to customize input. For example, the label_mapping file of a news classification model is label_mapping.json. The label IDs in the file must be of the INT type. The label IDs must be sorted in the same order as the enumerate values specified in the labelEnumerateValues parameter. The following code block shows an example of the label_mapping.json file:

    {"教育": 0,
     "三农": 1,
     ...,
     "动漫": 27}

    You can find the label_mapping.json file in the directory specified in the checkpointDir parameter.

    The following figure shows the files that are packaged.打包的模型文件

  3. Upload the package to OSS and record the OSS path of the package. Example: oss://xxx/your_model.zip.

  4. Deploy the model. For more information, see EasyTransfer Processor.