What is the machine reading comprehension training component - Platform For AI

The machine reading comprehension training component provided by Platform for AI (PAI) trains machine reading comprehension (MRC) models to read and comprehend given text passages and answer relevant questions. You can use the trained models to implement text-based intelligent conversation. This topic describes how to configure the component and provides an example on how to use the component.

Limits

You can use the machine reading comprehension training component based only on the computing resources of Deep Learning Containers (DLC).

Configure the component in Machine Learning Designer

Input ports
Input port (from left to right)
Data type
Recommended upstream component
Required
Training data
OSS
Read File Data
Yes
Validation data
OSS
Read File Data
Yes

Component parameters

Tab	Parameter	Description
Fields Setting	Language	The language of the input file. Default value: zh. Valid values: zh en
	Input Schema	The data schema of each column in the input file. Separate multiple columns with commas (,). Default value: qas_id:str:1,context_text:str:1,question_text:str:1,answer_text:str:1,start_position_character:str:1,title:str:1.
	Question Column	The name of the column that contains questions in the input file. Default value: question_text.
	Context Column	The name of the column that contains text passages in the input file. Default value: context_text.
	Answer Column	The name of the column that contains answers in the input file. Default value: answer_text.
	Id Column	The name of the ID column in the input file. Default value: qas_id.
	Start Position Column	The name of the column that contains the starting positions of answer spans in the input file. If the answer to a question can be found in the text passage, the starting position of the answer span is recorded in this column. Default value: start_position_character.
	Model Save Path	The path of the Object Storage Service (OSS) bucket that stores the trained or fine-tuned MRC model.
Parameters Setting	Batch Size	The number of samples that you want to process at a time. The value must be of the INT type. Default value: 4. If the model is trained on multiple workers with multiple GPUs, this parameter specifies the number of samples that are processed by each GPU per batch.
	Max Context Length	The maximum length of a text passage that can be handled. The value must be of the INT type. Default value: 384.
	Max Query Length	The maximum length of a question that can be handled. The value must be of the INT type. Default value: 64.
	Doc Stride	The length of a sliding window for each sliced text passage. The value must be of the INT type. Default value: 128.
	Num Epochs	The total number of epochs for training. The value must be of the INT type. Default value: 3.
	Learning Rate	The learning rate during model training. The value must be of the FLOAT type. Default value: 3.5e-5.
	Save Checkpoint Steps	The number of steps that are taken before the system evaluates the model and saves the optimal model. The value must be of the INT type. Default value: 600.
	Model selection	The name or path of the pre-trained model provided by the system. Default value: hfl/macbert-base-zh. Valid values: User Defined hfl/macbert-base-zh hfl/macbert-large-zh bert-base-uncased bert-large-uncased
	Custom Model Paths	This parameter is available only if you set the Model selection parameter to User Defined. If you want to use a custom pre-trained or fine-tuned model, specify the model parameters in the `{A: xxx, B: xxx}` format. Separate keys and values with colons (:). Separate multiple parameters with commas (,).
Tuning	GPU Machine Type	The instance type of the GPU-accelerated node that you want to use. The default value is gn5-c8g1.2xlarge, which specifies that the node uses 8 vCPUs, 80 GB memory, and a single P100 GPU.
Tuning	num_GPU_worker	The number of GPUs for each worker. Default value: 1.

Output ports

Output port (from left to right)	Data type	Downstream component
Model Storage Path	An OSS path. This path is the same as the path that you specified for the Model Save Path parameter on the Fields Setting tab. The generated trained model is stored in this path.	machine reading comprehension predict

Example

The following figure shows a sample pipeline in which the machine reading comprehension training component is used.

Perform the following steps to configure the component:

Prepare a training dataset and an evaluation dataset and then upload the datasets to the OSS bucket. For more information, see the "Upload an object" section in the Get started by using the OSS console topic.
A dataset can be in the TSV or TEXT format and contains the following columns:
- Training dataset
  ID column, text column, question column, answer column, start position column, and title column (optional)
- Evaluation dataset
  ID column, text column, question column, answer column (optional), start position column (optional), and title column (optional)
In this example, a TSV file is used to show how to train a model.
Use the Read File Data - 1 and Read File Data - 2 components to read the training dataset and the evaluation dataset. To do so, set the OSS Data Path parameter of the Read File Data-1 component to the OSS path of the training dataset, and set the OSS Data Path parameter of the Read File Data-2 component to the OSS path of the evaluation dataset.
Connect the Read File Data-1 and Read File Data-2 components to the machine reading comprehension training component as upstream nodes and configure the machine reading comprehension training component. For more information, see the "Component parameters" section of this topic.

References

The machine reading comprehension predict component allows you to perform batch predictions by using the models trained by the machine reading comprehension training component. For more information, see Machine Reading Comprehension Predict.
For more information about Machine Learning Designer components, see Overview of Machine Learning Designer.
Machine Learning Designer provides various preset algorithm components. You can select a component to process data based on your actual business scenario. For more information, see Component reference: Overview of all components.

Input port (from left to right)	Data type	Recommended upstream component	Required
Training data	OSS	Read File Data	Yes
Validation data	OSS	Read File Data	Yes