Use EasyASR for speech recognition

Platform for AI (PAI) provides EasyASR, which is an enhanced algorithm framework for speech intelligence. EasyASR provides a variety of features for model training and prediction. You can use EasyASR to train and apply speech recognition models for your speech recognition applications. This topic describes how to use EasyASR for speech recognition in Data Science Workshop (DSW).

Prerequisites

A DSW instance is created and the requirements for software versions are met. For more information, see Create and manage DSW instances and Limits.

Note

We recommend that you use a GPU-accelerated DSW instance.

Background information

In this example, the pre-trained wav2letter-small model is used. PAI also provides the pre-trained wav2letter-base, transformer-small, and transformer-base models for automatic speech recognition (ASR). To use a specific pre-trained model, click the corresponding file names in the following table to download the model files and adjust the code provided in this topic as needed.

Model	Vocabulary	Configuration file	Model file	Description

Model	Vocabulary	Configuration file	Model file	Description
wav2letter-small	alphabet4k.txt	w2lplus-small.py	model.ckpt.meta model.ckpt.index model.ckpt.data-00000-of-00001	The wav2letter series is suitable for scenarios in which low precision is acceptable but high inference speed is required. The wav2letter-base model has more parameters than the wav2letter-small model.
wav2letter-base	alphabet4k.txt	w2lplus-base.py	model.ckpt.meta model.ckpt.index model.ckpt.data-00000-of-00001
transformer-small	alphabet6k.txt	transformer-jca-small.py	model.ckpt.meta model.ckpt.index model.ckpt.data-00000-of-00001	The transformer series is suitable for scenarios in which low inference speed is acceptable but high precision is required. The transformer-base model has more parameters than the transformer-small model.
transformer-base	alphabet6k.txt	transformer-jca-base.py	model.ckpt.meta model.ckpt.index model.ckpt.data-00000-of-00001

Limits

Take note of the following items that are related to software versions:

Python 3.6 is supported.
TensorFlow 1.12 and PAI-TensorFlow V1.15 are supported.
All versions of PyTorch are not supported.
We recommend that you use the tensorflow:1.12PAI-gpu-py36-cu101-ubuntu18.04 or tensorflow:1.15-gpu-py36-cu100-ubuntu18.04 image of DSW.

Procedure

To use EasyASR for speech recognition in DSW, perform the following steps:

Step 1: Prepare data
Download the training data for speech recognition.
Step 2: Build a dataset and train the ASR model
Convert the training data to TFRecord files and train an ASR model.
Step 3: Evaluate and export the ASR model
After the training is complete, evaluate the recognition precision of the model. If you are satisfied with the model, export the model as a SavedModel file and use the file to perform distributed batch predictions.
Step 4: Perform predictions
Use the exported SavedModel file to perform predictions.

Step 1: Prepare data

In this example, a pre-trained ASR model provided in the EasyASR public model zoo, wav2letter-small, is slightly fine-tuned, and the dataset used is a subset of THCHS-30. THCHS-30 is a public dataset in Chinese. We recommend that you use your own data to train models.

Go to the development environment of Data Science Workshop (DSW).
1. Log on to the PAI console.
2. In the left-side navigation pane, click Workspaces. On the Workspace list page, click the name of the workspace that you want to manage.
3. In the upper-left corner of the page, select the region where you want to use the service.
4. In the left-side navigation pane, choose Model Training > Notebook Service (DSW).
5. Optional: On the Interactive Modeling (DSW) page, enter the name of a DSW instance or a keyword in the search box to search for the DSW instance.
6. Find the DSW instance and click Launch in the Actions column.
In the development environment of DSW, click Notebook in the top navigation bar.
Download data.
1. In the toolbar in the upper-left corner, click the icon to create a project folder. In this example, the folder is named asr_test.
2. In the DSW development environment, click Terminal in the top navigation bar. On the Terminal tab, click Create Terminal.
3. Run the following commands in Terminal. The cd command is used to go to the folder that you create, and the wget commands are used to download the demo dataset that is used to train the ASR model.
```
cd asr_test
wget https://pai-audio-open-modelzoo.oss-cn-zhangjiakou.aliyuncs.com/dsw_sample_data/demo_data.tar.gz
wget https://pai-audio-open-modelzoo.oss-cn-zhangjiakou.aliyuncs.com/dsw_sample_data/sample_asr_data.csv
```
4. Run the following commands in Terminal to create a subfolder named data and decompress the demo dataset to the subfolder:
```
mkdir data
tar xvzf demo_data.tar.gz -C data
```
5. Download an ASR model.
  Four pre-trained ASR models, wav2letter-small, wav2letter-base, transformer-small, and transformer-base, are provided in the EasyASR public model zoo. The two wav2letter models provide higher inference speed, whereas the two transformer models provide higher precision. In this example, a wav2letter model is used. Run the following commands in Terminal to download the wav2letter-small model:
```
mkdir wav2letter-small
wget -P wav2letter-small https://pai-audio-open-modelzoo.oss-cn-zhangjiakou.aliyuncs.com/public_model_zoo/w2lplus-small/model.ckpt.index
wget -P wav2letter-small https://pai-audio-open-modelzoo.oss-cn-zhangjiakou.aliyuncs.com/public_model_zoo/w2lplus-small/model.ckpt.meta
wget -P wav2letter-small https://pai-audio-open-modelzoo.oss-cn-zhangjiakou.aliyuncs.com/public_model_zoo/w2lplus-small/model.ckpt.data-00000-of-00001
wget https://pai-audio-open-modelzoo.oss-cn-zhangjiakou.aliyuncs.com/public_model_zoo/w2lplus-small/alphabet4k.txt
wget https://pai-audio-open-modelzoo.oss-cn-zhangjiakou.aliyuncs.com/public_model_zoo/w2lplus-small/w2lplus-small.py
```
View the subfolders and files in the project folder asr_test.
The project folder contains the following subfolders and files:
- data: the subfolder that stores the speech files that are used for model training. Generally, a speech file for model training is a mono audio file in the WAV format with a length of up to 15 seconds and a sampling rate of 16,000 HZ.
- w2lplus-small: the subfolder that stores the pre-training checkpoints of the model.
- alphabet4K.txt: the file that stores the 4K Chinese character vocabulary for the model.
- sample_asr_data.csv: the file that stores the paths and annotations of all WAV files. If you want to use custom data, you must separate characters with spaces and sentences with semicolons (;) in an annotation. The characters specified must be in the vocabulary. If a character is not in the vocabulary, replace the character with an asterisk (*).
- w2lplus-small.py: the configuration file of the model.
You can go to the w2lplus-small folder to view the pre-training checkpoints of the model, as shown in the following figure.

Step 2: Build a dataset and train the ASR model

Convert the data that you prepare to TFRecord files by using the data conversion feature of EasyASR. To do so, run the following command in the asr_test folder:
```
easyasr_create_dataset --input_path='sample_asr_data.csv' --output_prefix='tfrecords/'
```
The command contains the following parameters:
- input_path: the name of the CSV file that specifies the training data. The file contains the paths and annotations of all WAV files to be used for the training.
- output_prefix: the prefix of the path of the output TFRecord files. In this example, all TFRecord files are exported in the tfrecords folder. You can modify this parameter as required.
  Important
  Do not omit the forward slash (/) at the end of the path.
Run the following command in Terminal to train the ASR model:
```
easyasr_train --config_file='w2lplus-small.py' --log_dir='model_dir' --load_model_ckpt='wav2letter-small/model.ckpt' --vocab_file='alphabet4k.txt' --train_data='tfrecords/train_*.tfrecord'
```
The command contains the following parameters:
- config_file: the configuration file of the model. In this example, the configuration file of the wav2letter-small model named w2plus-small.py is used. You can modify this parameter as required.
- log_dir: the path of the output model checkpoints. You can modify this parameter as required.
- load_model_ckpt: the pre-training checkpoints of the model. In this example, the pre-training checkpoints of the wav2letter-small model are loaded. If you do not specify this parameter, the model is to be trained from scratch.
- vocab_file: the Chinese character vocabulary for the model. If you use a pre-trained wav2letter model, set this parameter to alphabet4k.txt and keep the TXT file unchanged. If you use a pre-trained transformer model, set this parameter to alphabet6k.txt and keep the TXT file unchanged.
- train_data: the TFRecord files to be used for the training. The value of this parameter must be a regular expression. You can modify this parameter as required.

Step 3: Evaluate and export the ASR model

After the training is complete, you can evaluate the recognition precision of the model. You can divide a dataset into a training dataset and a prediction dataset as needed. The following section provides an example on how to evaluate and export a model.

Run the following command in Terminal to evaluate the recognition precision of the model:
```
easyasr_eval --config_file='w2lplus-small.py' --checkpoint='model_dir/model.ckpt-1000' --vocab_file='alphabet4k.txt' --eval_data='tfrecords/train_*.tfrecord'
```
The command contains the following parameters:
- config_file: the configuration file of the model. In this example, the configuration file of the wav2letter-small model named w2plus-small.py is used. You can modify this parameter as required.
- checkpoint: the path of the checkpoints of the model to be evaluated and exported. Multiple checkpoints are saved during the training. You can modify this parameter as required.
- vocab_file: the Chinese character vocabulary for the model.
  Important
  You must use the same vocabulary to train and evaluate a model.
- eval_data: the TFRecord files to be used to evaluate the model. The value format of this parameter is the same as that of the train_data parameter.
Export the trained model as a SavedModel file and use the file to perform distributed batch predictions. To do so, run the following command in Terminal to export the model:
```
easyasr_export --config_file='w2lplus-small.py' --checkpoint='model_dir/model.ckpt-1000' --vocab_file='alphabet4k.txt'  --mode='interactive_infer'
```
The command contains the following parameters:
- config_file: the configuration file of the model. In this example, the configuration file of the wav2letter-small model named w2plus-small.py is used. You can modify this parameter as required.
- checkpoint: the path of the checkpoints of the model to be evaluated and exported. Multiple checkpoints are saved during the training. You can modify this parameter as required.
- vocab_file: the Chinese character vocabulary for the model.
- mode: the mode in which the model is to be exported. The current version of EasyASR supports only the interactive_infer mode. You do not need to change the parameter value used in the sample command.
You can view the exported model in the asr_test folder. The exported SavedModel file is stored in the export_dir subfolder, as shown in the following figures. Go to the export_dir subfolder to view the exported model, as shown in the following figure.

Step 4: Perform predictions

You can use the exported SavedModel file to perform predictions. If you use EasyASR in DSW, input and output data are stored in CSV files.

Run the following commands in Terminal to install FFmpeg that you can use for audio decoding:
```
sudo apt update
sudo apt install ffmpeg
```
Note
In this example, Ubuntu is used. If you use another OS, you need to only install FFmpeg in the OS. If FFmpeg is already installed, skip this step.
Run the following command in the asr_test folder in Terminal to download the sample input file:
```
wget https://pai-audio-open-modelzoo.oss-cn-zhangjiakou.aliyuncs.com/dsw_sample_data/input_predict.csv
```
Each row in the input file indicates the URL of an audio file.
Run the following command in Terminal to perform predictions on the input file by using the ASR model that you have trained:
```
easyasr_predict --input_csv='input_predict.csv' --output_csv='output_predict.csv' --num_features=64 --use_model='w2l' --vocab_file='alphabet4k.txt' --export_dir='export_dir' --num_predict_process=3  --num_preproces=3
```
The command contains the following parameters:
- input_csv: the name of the input file that contains the URLs of audio files. You can modify this parameter as required.
- output_csv: the name of the output file to be generated for the predictions. You can enter a custom name without the need to create a file with the name in advance.
- num_features: the acoustic feature dimension of the model. If you use the pre-trained wav2letter-small or wav2letter-base model, set this parameter to 64. If you use the pre-trained transformer-small or transformer-base model, set this parameter to 80. You can modify this parameter as required.
- use_model: the type of the model. Valid values:
  - w2l: a wav2letter model.
  - transformer: a transformer model.
  In this example, this parameter is set to w2l because the wav2letter-small model is used to perform predictions.
- vocab_file: the Chinese character vocabulary for the model.
- export_dir: the path of the exported SavedModel file. You can modify this parameter as required.
- num_predict_process: the number of threads to be used to perform predictions. You can modify this parameter as required.
- num_preproces: the number of threads to be used to download and preprocess audio files. You can modify this parameter as required.

Prerequisites

Background information

Limits

Procedure

Step 1: Prepare data

Step 2: Build a dataset and train the ASR model

Step 3: Evaluate and export the ASR model

Step 4: Perform predictions

Sales Support

Technical Support

Connect & Report Abuse

About Alibaba Cloud

Our Global Network

Quick Start

Global Offices

Olympic Games Paris 2024 New

Stade Roland Garros – Glitz from the Past New

Place de la Concorde – “Breaking” the Barriers New

Vaires-sur-Marne Nautical Stadium – Sports with Sustainability New

International Broadcast Center – Images, Sounds, and Data that Captivate Billions New

Customer Success Stories New

Trust Center

Security & Compliance Center

Cloud Compliance Resources

Security Compliance FAQs

Product & Feature Update New

Cloud Forward

Press Room

Alibaba Cloud e-Magazine New

Alibaba Cloud in Analyst Research

Notice

Go Global Service New

Go Global Alliance with Alibaba Cloud

China Gateway Hot

Information Compliance

China Gateway - MLPS 2.0 Compliance New

China Gateway - Networking

China Gateway - Global Application Acceleration New

China Gateway - Security

China Gateway - Data Security New

ICP Support Hot

China Gateway - Omnichannel Data Mid-End New

China Gateway - Organizational Data Mid-End New

China Gateway - Business Mid-End New

China Gateway - AI Service for Conversational Chatbots New

China Gateway - Online Education

China Gateway - Domain Registration

Work at Alibaba Cloud

Experienced Professionals

Students and Graduates

Free Trial

Pricing

Promo Center

Price Reduction

Pay Less and Deploy More

FinOps

Elastic Compute Service (ECS)

Simple Application Server (SAS)

Elastic GPU Service

Elastic Desktop Service (EDS)

Object Storage Service (OSS)

Cloud Enterprise Network (CEN)

Web Application Firewall (WAF)

Domain Names

Secure Access Service Edge (SASE)

Intelligent Media Services(IMS)

Edge Security Acceleration (ESA)(Original DCDN)

Intelligent Media Management

DingTalk Enterprise

YiDA

Alibaba Cloud Model Studio

ApsaraDB for SelectDB

Apsara Prime - For Easy Cloud Product Selection

Alibaba Cloud ECS - Cater All Your Cloud Hosting Needs

1TB CDN—Get Free 1 TB Outbound Traffic Plan Now

Security—Under Attack? Get Free Security Support

Short Message Service - Free Testing is Available

Elastic Compute Service (ECS) Hot

CloudBox

Compute Nest

Dedicated Host Hot

ECS Bare Metal Instance

Elastic Desktop Service (EDS) Featured

Cloud Phone Beta

Elastic GPU Service Featured

Simple Application Server (SAS) Hot

Auto Scaling

Batch Compute

Elastic High Performance Computing (E-HPC)

Super Computing Cluster (SCC)

Function Compute (FC)