Platform for AI (PAI) provides EasyASR, which is an enhanced algorithm framework for speech intelligence. EasyASR provides a variety of features for model training and prediction. You can use EasyASR to train and apply speech recognition models for your speech recognition applications. This topic describes how to use EasyASR for speech recognition in Data Science Workshop (DSW).
Prerequisites
A DSW instance is created and the requirements for software versions are met. For more information, see Create and manage DSW instances and Limits.
We recommend that you use a GPU-accelerated DSW instance.
Background information
In this example, the pre-trained wav2letter-small model is used. PAI also provides the pre-trained wav2letter-base, transformer-small, and transformer-base models for automatic speech recognition (ASR). To use a specific pre-trained model, click the corresponding file names in the following table to download the model files and adjust the code provided in this topic as needed.
Model | Vocabulary | Configuration file | Model file | Description |
wav2letter-small | The wav2letter series is suitable for scenarios in which low precision is acceptable but high inference speed is required. The wav2letter-base model has more parameters than the wav2letter-small model. | |||
wav2letter-base | ||||
transformer-small | The transformer series is suitable for scenarios in which low inference speed is acceptable but high precision is required. The transformer-base model has more parameters than the transformer-small model. | |||
transformer-base |
Limits
Take note of the following items that are related to software versions:
Python 3.6 is supported.
TensorFlow 1.12 and PAI-TensorFlow V1.15 are supported.
All versions of PyTorch are not supported.
We recommend that you use the
tensorflow:1.12PAI-gpu-py36-cu101-ubuntu18.04
ortensorflow:1.15-gpu-py36-cu100-ubuntu18.04
image of DSW.
Procedure
To use EasyASR for speech recognition in DSW, perform the following steps:
Download the training data for speech recognition.
Step 2: Build a dataset and train the ASR model
Convert the training data to TFRecord files and train an ASR model.
Step 3: Evaluate and export the ASR model
After the training is complete, evaluate the recognition precision of the model. If you are satisfied with the model, export the model as a SavedModel file and use the file to perform distributed batch predictions.
Use the exported SavedModel file to perform predictions.
Step 1: Prepare data
In this example, a pre-trained ASR model provided in the EasyASR public model zoo, wav2letter-small, is slightly fine-tuned, and the dataset used is a subset of THCHS-30. THCHS-30 is a public dataset in Chinese. We recommend that you use your own data to train models.
Go to the development environment of Data Science Workshop (DSW).
Log on to the PAI console.
In the left-side navigation pane, click Workspaces. On the Workspace list page, click the name of the workspace that you want to manage.
In the upper-left corner of the page, select the region where you want to use the service.
In the left-side navigation pane, choose .
Optional: On the Interactive Modeling (DSW) page, enter the name of a DSW instance or a keyword in the search box to search for the DSW instance.
Find the DSW instance and click Launch in the Actions column.
In the development environment of DSW, click Notebook in the top navigation bar.
Download data.
In the toolbar in the upper-left corner, click the icon to create a project folder. In this example, the folder is named asr_test.
In the DSW development environment, click Terminal in the top navigation bar. On the Terminal tab, click Create Terminal.
Run the following commands in Terminal. The
cd
command is used to go to the folder that you create, and thewget
commands are used to download the demo dataset that is used to train the ASR model.cd asr_test wget https://pai-audio-open-modelzoo.oss-cn-zhangjiakou.aliyuncs.com/dsw_sample_data/demo_data.tar.gz wget https://pai-audio-open-modelzoo.oss-cn-zhangjiakou.aliyuncs.com/dsw_sample_data/sample_asr_data.csv
Run the following commands in Terminal to create a subfolder named data and decompress the demo dataset to the subfolder:
mkdir data tar xvzf demo_data.tar.gz -C data
Download an ASR model.
Four pre-trained ASR models, wav2letter-small, wav2letter-base, transformer-small, and transformer-base, are provided in the EasyASR public model zoo. The two wav2letter models provide higher inference speed, whereas the two transformer models provide higher precision. In this example, a wav2letter model is used. Run the following commands in Terminal to download the wav2letter-small model:
mkdir wav2letter-small wget -P wav2letter-small https://pai-audio-open-modelzoo.oss-cn-zhangjiakou.aliyuncs.com/public_model_zoo/w2lplus-small/model.ckpt.index wget -P wav2letter-small https://pai-audio-open-modelzoo.oss-cn-zhangjiakou.aliyuncs.com/public_model_zoo/w2lplus-small/model.ckpt.meta wget -P wav2letter-small https://pai-audio-open-modelzoo.oss-cn-zhangjiakou.aliyuncs.com/public_model_zoo/w2lplus-small/model.ckpt.data-00000-of-00001 wget https://pai-audio-open-modelzoo.oss-cn-zhangjiakou.aliyuncs.com/public_model_zoo/w2lplus-small/alphabet4k.txt wget https://pai-audio-open-modelzoo.oss-cn-zhangjiakou.aliyuncs.com/public_model_zoo/w2lplus-small/w2lplus-small.py
View the subfolders and files in the project folder asr_test.
The project folder contains the following subfolders and files:
data: the subfolder that stores the speech files that are used for model training. Generally, a speech file for model training is a mono audio file in the WAV format with a length of up to 15 seconds and a sampling rate of 16,000 HZ.
w2lplus-small: the subfolder that stores the pre-training checkpoints of the model.
alphabet4K.txt: the file that stores the 4K Chinese character vocabulary for the model.
sample_asr_data.csv: the file that stores the paths and annotations of all WAV files. If you want to use custom data, you must separate characters with spaces and sentences with semicolons (;) in an annotation. The characters specified must be in the vocabulary. If a character is not in the vocabulary, replace the character with an asterisk (*).
w2lplus-small.py: the configuration file of the model.
You can go to the w2lplus-small folder to view the pre-training checkpoints of the model, as shown in the following figure.
Step 2: Build a dataset and train the ASR model
Convert the data that you prepare to TFRecord files by using the data conversion feature of EasyASR. To do so, run the following command in the asr_test folder:
easyasr_create_dataset --input_path='sample_asr_data.csv' --output_prefix='tfrecords/'
The command contains the following parameters:
input_path: the name of the CSV file that specifies the training data. The file contains the paths and annotations of all WAV files to be used for the training.
output_prefix: the prefix of the path of the output TFRecord files. In this example, all TFRecord files are exported in the tfrecords folder. You can modify this parameter as required.
ImportantDo not omit the forward slash (/) at the end of the path.
Run the following command in Terminal to train the ASR model:
easyasr_train --config_file='w2lplus-small.py' --log_dir='model_dir' --load_model_ckpt='wav2letter-small/model.ckpt' --vocab_file='alphabet4k.txt' --train_data='tfrecords/train_*.tfrecord'
The command contains the following parameters:
config_file: the configuration file of the model. In this example, the configuration file of the wav2letter-small model named w2plus-small.py is used. You can modify this parameter as required.
log_dir: the path of the output model checkpoints. You can modify this parameter as required.
load_model_ckpt: the pre-training checkpoints of the model. In this example, the pre-training checkpoints of the wav2letter-small model are loaded. If you do not specify this parameter, the model is to be trained from scratch.
vocab_file: the Chinese character vocabulary for the model. If you use a pre-trained wav2letter model, set this parameter to alphabet4k.txt and keep the TXT file unchanged. If you use a pre-trained transformer model, set this parameter to alphabet6k.txt and keep the TXT file unchanged.
train_data: the TFRecord files to be used for the training. The value of this parameter must be a regular expression. You can modify this parameter as required.
Step 3: Evaluate and export the ASR model
After the training is complete, you can evaluate the recognition precision of the model. You can divide a dataset into a training dataset and a prediction dataset as needed. The following section provides an example on how to evaluate and export a model.
Run the following command in Terminal to evaluate the recognition precision of the model:
easyasr_eval --config_file='w2lplus-small.py' --checkpoint='model_dir/model.ckpt-1000' --vocab_file='alphabet4k.txt' --eval_data='tfrecords/train_*.tfrecord'
The command contains the following parameters:
config_file: the configuration file of the model. In this example, the configuration file of the wav2letter-small model named w2plus-small.py is used. You can modify this parameter as required.
checkpoint: the path of the checkpoints of the model to be evaluated and exported. Multiple checkpoints are saved during the training. You can modify this parameter as required.
vocab_file: the Chinese character vocabulary for the model.
ImportantYou must use the same vocabulary to train and evaluate a model.
eval_data: the TFRecord files to be used to evaluate the model. The value format of this parameter is the same as that of the train_data parameter.
Export the trained model as a SavedModel file and use the file to perform distributed batch predictions. To do so, run the following command in Terminal to export the model:
easyasr_export --config_file='w2lplus-small.py' --checkpoint='model_dir/model.ckpt-1000' --vocab_file='alphabet4k.txt' --mode='interactive_infer'
The command contains the following parameters:
config_file: the configuration file of the model. In this example, the configuration file of the wav2letter-small model named w2plus-small.py is used. You can modify this parameter as required.
checkpoint: the path of the checkpoints of the model to be evaluated and exported. Multiple checkpoints are saved during the training. You can modify this parameter as required.
vocab_file: the Chinese character vocabulary for the model.
mode: the mode in which the model is to be exported. The current version of EasyASR supports only the interactive_infer mode. You do not need to change the parameter value used in the sample command.
You can view the exported model in the asr_test folder. The exported SavedModel file is stored in the export_dir subfolder, as shown in the following figures. Go to the export_dir subfolder to view the exported model, as shown in the following figure.
Step 4: Perform predictions
You can use the exported SavedModel file to perform predictions. If you use EasyASR in DSW, input and output data are stored in CSV files.
Run the following commands in Terminal to install FFmpeg that you can use for audio decoding:
sudo apt update sudo apt install ffmpeg
NoteIn this example, Ubuntu is used. If you use another OS, you need to only install FFmpeg in the OS. If FFmpeg is already installed, skip this step.
Run the following command in the asr_test folder in Terminal to download the sample input file:
wget https://pai-audio-open-modelzoo.oss-cn-zhangjiakou.aliyuncs.com/dsw_sample_data/input_predict.csv
Each row in the input file indicates the URL of an audio file.
Run the following command in Terminal to perform predictions on the input file by using the ASR model that you have trained:
easyasr_predict --input_csv='input_predict.csv' --output_csv='output_predict.csv' --num_features=64 --use_model='w2l' --vocab_file='alphabet4k.txt' --export_dir='export_dir' --num_predict_process=3 --num_preproces=3
The command contains the following parameters:
input_csv: the name of the input file that contains the URLs of audio files. You can modify this parameter as required.
output_csv: the name of the output file to be generated for the predictions. You can enter a custom name without the need to create a file with the name in advance.
num_features: the acoustic feature dimension of the model. If you use the pre-trained wav2letter-small or wav2letter-base model, set this parameter to 64. If you use the pre-trained transformer-small or transformer-base model, set this parameter to 80. You can modify this parameter as required.
use_model: the type of the model. Valid values:
w2l: a wav2letter model.
transformer: a transformer model.
In this example, this parameter is set to w2l because the wav2letter-small model is used to perform predictions.
vocab_file: the Chinese character vocabulary for the model.
export_dir: the path of the exported SavedModel file. You can modify this parameter as required.
num_predict_process: the number of threads to be used to perform predictions. You can modify this parameter as required.
num_preproces: the number of threads to be used to download and preprocess audio files. You can modify this parameter as required.