API reference - Intelligent Speech Interaction - Alibaba Cloud Documentation Center

You can use the recording file recognition service to recognize recording files. However, the service does not recognize recording files in real time. In addition, to recognize a recording file, you must submit a reachable HTTP or HTTPS URL of the file, but not the local file.

Features

Recognizes single-track recording files in WAV and MP3 formats.
Supports two call methods: polling and callback.
Supports custom linguistic models and hotwords.
Recognizes multiple languages, such as Chinese Mandarin, Chinese dialects, and English.

Call limits

The access permissions on recording files that you want to recognize must be public. The URL of each recording file can contain the domain name, but not the IP address. In addition, the URL cannot contain spaces.
Valid URL
Invalid URL
https://aliyun-nls.oss-cn-hangzhou.aliyuncs.com/asr/fileASR/examples/nls-sample-16k.wav
http://127.0.0.1/sample.wav
D:\files\sample.wav
The maximum file size is 512 MB.
If you use the free trial edition, the server completes the recognition task and returns the recognition result within 24 hours after you send a recording file recognition request. If you use Commercial Edition, the server completes the recognition task and returns the recognition result within 3 hours after you send a recording file recognition request. The server retains the recognition result for 72 hours.
Note
The preceding time limits do not apply if the recording files that you upload within 30 minutes exceed 500 hours in length. If you need to recognize a large amount of audio data, contact the Alibaba Cloud pre-sales staff.
You can use the free trial edition to recognize recording files that are up to 2 hours in length on each calendar day.

Procedure

Check the format and audio sampling rate of your recording file. Select an appropriate scenario and model in the Intelligent Speech Interaction console based on your business scenario.
Store the recording file in Alibaba Cloud Object Storage Service (OSS).
If the access permissions on the recording file are public, directly obtain the OSS URL of the recording file. For more information, see Public read object. If the access permissions on the recording file are private, use the SDK to generate an OSS URL that has a validity period. For more information, see Private object.
Note
You can also build a file server and store the recording file on it. To download the recording file from the file server, make sure that the length indicated by the Content-Length field in the HTTP response header is the same as the length of data in the response body. Otherwise, the recording file fails to be downloaded.
Send a recording file recognition request from the client.
If the request is successful, the server returns the task ID. You can use the task ID to poll the recognition result.
Send a request from the client to query the recognition result.
The client queries the recognition result based on the task ID that is obtained in Step c. The server retains the recognition result for 72 hours.

API call methods

The recording file recognition service provides the Alibaba Cloud pctowap open platform (POP) API that can be called in a remote procedure call (RPC) style. To call an API operation, the client encapsulates parameters in a request and uses an HTTP method to send the request. The server returns the result in a response. You must store recording files that you want to recognize on a server and make sure that each file can be accessed by using a URL. We recommend that you store recording files in Alibaba Cloud OSS.

The recording file recognition POP API supports two operations: use the POST method to send a recording file recognition request and use the GET method to query the recording file recognition result.

Operation to send a recording file recognition request:

If you use the polling method, you can send a recording file recognition request and obtain the task ID for subsequent recognition result polling.
If you use the callback method, you can send a recording file recognition request and a callback URL. If the request is successful, the server uses the POST method to send the recognition result to the callback URL. Make sure that the callback URL can receive a POST request.

Note

In earlier versions of the recording file recognition service (2.0 by default), the recognition result obtained by the callback method differs from that obtained by the polling method. The differences lie in the style and fields of the JSON string. In version 4.0, the recording file recognition service updates the recognition result obtained by the callback method to a camelCase JSON string. This produces the same recognition result as that obtained by the polling method.

If you have activated the recording file recognition service without setting the version to 4.0, its version is 2.0 by default. You can continue to use this version. If you are a new user, set the version of the recording file recognition service to 4.0.

Request parameters:

When you send a recording file recognition request, you must set request parameters and add these parameters in the format of a JSON string to the request body. The following example shows request parameters in JSON format:

{
    "appkey": "your-appkey",
    "file_link": "https://aliyun-nls.oss-cn-hangzhou.aliyuncs.com/asr/fileASR/examples/nls-sample-16k.wav",
    "auto_split":false,
    "version": "4.0",
    "enable_words": false,
    "enable_sample_rate_adaptive": true,
    // The valid_times parameter specifies the valid time period that truly requires speech recognition in the total length of an audio track. This parameter is optional.
    "valid_times": [
        {
            "begin_time": 200,
            "end_time":2000,
            "channel_id": 0
        }
    ]
}

Parameter	Type	Required	Description
appkey	String	Yes	The appkey of your project in the Intelligent Speech Interaction console.
file_link	String	Yes	The URL of the recording file. Make sure that the scenario and model of the project created in the Intelligent Speech Interaction console are suitable for the recording file.
version	String	Yes	The version of the recording file recognition service. Default value: 2.0. Set this parameter to 4.0.
enable_words	Boolean	No	Specifies whether to return the recognition results of words. Default value: false. This parameter takes effect only when the version parameter is set to 4.0.
enable_sample_rate_adaptive	Boolean	No	Specifies whether to automatically downsample an audio file with a sampling rate that is greater than 16,000 Hz. Default value: false. This parameter takes effect only when the version parameter is set to 4.0.
enable_callback	Boolean	No	Specifies whether to enable the callback method. Default value: false.
callback_url	String	No	The callback URL. You must specify this parameter if you set the enable_callback parameter to true. The callback URL can be an HTTP or HTTPS URL. It can contain the domain name, but not the IP address.
auto_split	Boolean	No	Specifies whether to enable automatic track splitting. If you enable automatic track splitting, the server can identify the speaker of each sentence in a conversation between two parties based on the ChannelId parameter in the recognition result of the sentence. Usually, the value of the ChannelId parameter is 1 for the first speaker in the conversation. Only mono audio files with a sampling rate of 8,000 Hz are supported.
enable_unify_post	Boolean	No	Specifies whether to enable post-processing. Default value: false. Note The auto_split and enable_unify_post parameters cannot be both set to true.
enable_inverse_text_normalization	Boolean	No	Specifies whether to enable inverse text normalization (ITN). Valid values: true and false. Default value: false. If you set this parameter to true, Chinese numerals are converted to Arabic numerals. This parameter takes effect only when the version parameter is set to 4.0 and the enable_unify_post parameter is set to true. Note ITN is not implemented on words.
enable_disfluency	Boolean	No	Specifies whether to enable disfluency detection. Default value: false. This parameter takes effect only when the version parameter is set to 4.0 and the enable_unify_post parameter is set to true.
valid_times	List< ValidTime >	No	The valid time period that truly requires speech recognition in the total length of an audio track.
max_end_silence	Integer	No	The maximum duration of end silence. Default value: 450. Unit: milliseconds.
max_single_segment_time	Integer	No	The maximum duration of a single sentence. Minimum value: 10000. Default value: 20000. Unit: milliseconds.
customization_id	String	No	The ID of the custom linguistic model that is created by using the POP API. This parameter is not specified by default.
class_vocabulary_id	String	No	The ID of the created categorized hotword vocabulary. This parameter is not specified by default.
vocabulary_id	String	No	The ID of the created extensive hotword vocabulary. This parameter is not specified by default.

The following table describes the parameters in the ValidTime object.

Parameter	Type	Required	Description
begin_time	Int	Yes	The start time offset of the valid time period. Unit: milliseconds.
end_time	Int	Yes	The end time offset of the valid time period. Unit: milliseconds.
channel_id	Int	Yes	The sequence number of the audio track to which the setting of the valid time period applies. The value starts from 0.

Response parameters:

The server returns a response to the recording file recognition request. The response includes response parameters in the format of a JSON string. For example, the server returns the following response:

{
        "TaskId": "4b56f0c4b7e611e88f34c33c2a60****",
        "RequestId": "E4B183CC-6CFE-411E-A547-D877F7BD****",
        "StatusText": "SUCCESS",
        "StatusCode": 21050000
}

HTTP status code 200 indicates that the request is successful. For more information, see HTTP status codes.

Parameter	Type	Required	Description
TaskId	String	Yes	The ID of the recognition task.
RequestId	String	Yes	The ID of the request. This parameter is used only for debugging.
StatusCode	Int	Yes	The status code.
StatusText	String	Yes	The status message.

Operation to query the recording file recognition result:

If the recording file recognition request that you send is successful, the server returns the task ID. You can use the task ID to poll the recognition result.

Request parameters:

After the server returns the response to the recording file recognition request, you can use the task ID in the response as a parameter to query the recognition result. When you call the query operation, you must set a polling interval.

Important

The query operation supports up to 100 queries per second (QPS). If the QPS exceeds 100, the following error may be returned: Throttling.User : Request was denied due to user flow control. We recommend that you set a longer polling interval.

Parameter	Type	Required	Description
TaskId	String	Yes	The ID of the recognition task.

Response parameters:

The server returns a response to the query request for the recording file recognition result. The response includes response parameters in the format of a JSON string.

The following sample success response shows the recognition result of the single-track recording file nls-sample-16k.wav:

{
        "TaskId": "d429dd7dd75711e89305ab6170fe****",
        "RequestId": "9240D669-6485-4DCC-896A-F8B31F94****",
        "StatusText": "SUCCESS",
        "BizDuration": 2956,
        "SolveTime": 1540363288472,
        "StatusCode": 21050000,
        "Result": {
                "Sentences": [{
                        "EndTime": 2365,
                        "SilenceDuration": 0,
                        "BeginTime": 340,
                        "Text": "Weather in Beijing",
                        "ChannelId": 0,
                        "SpeechRate": 177,
                        "EmotionValue": 5.0
                }]
        }
}

Assume that you set the enable_callback parameter to true, specify the callback_url parameter, and set the version parameter to 4.0. The following response shows the recognition result that is obtained by the callback method:

{
        "Result": {
                "Sentences": [{
                        "EndTime": 2365,
                        "SilenceDuration": 0,
                        "BeginTime": 340,
                        "Text": "Weather in Beijing",
                        "ChannelId": 0,
                        "SpeechRate": 177,
                        "EmotionValue": 5.0
                }]
        },
        "TaskId": "36d01b244ad811e9952db7bb7ed2****",
        "StatusCode": 21050000,
        "StatusText": "SUCCESS",
        "RequestTime": 1553062810452,
        "SolveTime": 1553062810831,
        "BizDuration": 2956
}

Note

The value of the RequestTime parameter is a timestamp that indicates when the recording file recognition request is sent, in milliseconds. For example, a value of 1553062810452 indicates 14:20:10 on March 20, 2019, UTC+8.
The value of the SolveTime parameter is a timestamp that indicates when the recording file recognition task is completed, in milliseconds.

The following response shows that the task is queuing:

{
        "TaskId": "c7274235b7e611e88f34c33c2a60****",
        "RequestId": "981AD922-0655-46B0-8C6A-5C836822****",
        "StatusText": "QUEUEING",
        "StatusCode": 21050002
}

The following response shows that the task is running:

{
        "TaskId": "c7274235b7e611e88f34c33c2a60****",
        "RequestId": "8E908ED2-867F-457E-82BF-4756194A****",
        "StatusText": "RUNNING",
        "BizDuration": 0,
        "StatusCode": 21050001
}

The following sample error response shows that the recording file fails to be downloaded:

{
        "TaskId": "4cf25b7eb7e711e88f34c33c2a60****",
        "RequestId": "098BF27C-4CBA-45FF-BD11-3F532F26****",
        "StatusText": "FILE_DOWNLOAD_FAILED",
        "BizDuration": 0,
        "SolveTime": 1536906469146,
        "StatusCode": 41050002
}

Note

For more information, see the error codes and solutions in the "Service status codes" section of this topic.

HTTP status code 200 indicates that the request is successful. For more information, see HTTP status codes.

Parameter	Type	Required	Description
TaskId	String	Yes	The ID of the recognition task.
StatusCode	Int	Yes	The status code.
StatusText	String	Yes	The status message.
RequestId	String	Yes	The ID of the request. This parameter is used for debugging.
Result	Object	Yes	The recognition result object.
Sentences	List< SentenceResult >	Yes	The recognition results of sentences. This parameter is returned only when the value of the StatusText parameter is SUCCESS.
Words	List< WordResult >	No	The recognition results of words. This parameter is returned only when the enable_words parameter is set to true and the version parameter is set to 4.0.
BizDuration	Long	Yes	The total duration of the recording file that is recognized. Unit: milliseconds.
SolveTime	Long	Yes	The timestamp that indicates when the recording file recognition task is completed. Unit: milliseconds.

The following table describes the parameters in the recognition result of each sentence.

Parameter	Type	Required	Description
ChannelId	Int	Yes	The ID of the audio track to which the sentence belongs.
BeginTime	Int	Yes	The start time offset of the sentence. Unit: milliseconds.
EndTime	Int	Yes	The end time offset of the sentence. Unit: milliseconds.
Text	String	Yes	The recognition result of the sentence.
EmotionValue	Int	Yes	The emotion value. The value is equal to the volume decibel value divided by 10. Valid values: [1,10]. A greater value indicates a stronger emotion.
SilenceDuration	Int	Yes	The silence duration between the current and the previous sentences. Unit: seconds.
SpeechRate	Int	Yes	The average speech rate of the sentence. Unit: words per minute.

Recognition results of words

If the enable_words parameter is set to true and the version parameter is set to 4.0, the server returns the recognition results of words in the response. The recognition results of words obtained by the polling method are the same as those obtained by the callback method. The following response shows the recognition results that are obtained by the polling method:

{
        "StatusCode": 21050000,
        "Result": {
                "Sentences": [{
                        "SilenceDuration": 0,
                        "EmotionValue": 5.0,
                        "ChannelId": 0,
                        "Text": "Weather in Beijing",
                        "BeginTime": 340,
                        "EndTime": 2365,
                        "SpeechRate": 177
                }],
                "Words": [{
                        "ChannelId": 0,
                        "Word": "Weather",
                        "BeginTime": 640,
                        "EndTime": 940
                }, {
                        "ChannelId": 0,
                        "Word": "in",
                        "BeginTime": 940,
                        "EndTime": 1120
                }, {
                        "ChannelId": 0,
                        "Word": "Beijing",
                        "BeginTime": 1120,
                        "EndTime": 2020
                }]
        },
        "SolveTime": 1553236968873,
        "StatusText": "SUCCESS",
        "RequestId": "027B126B-4AC8-4C98-9FEC-A031158F****",
        "TaskId": "b505e78c4c6d11e9a213e11db149****",
        "BizDuration": 2956
}

The following table describes the parameters in the recognition result of each word.

Parameter	Type	Required	Description
BeginTime	Int	Yes	The start time of the word. Unit: milliseconds.
EndTime	Int	Yes	The end time of the word. Unit: milliseconds.
ChannelId	Int	Yes	The ID of the audio track to which the word belongs.
Word	String	Yes	The recognition result of the word.

Service status codes

The following table describes the normal status codes.

Status code	Status message	Description	Solution
21050000	SUCCESS	The request is successful after you use the POST method to send a recording file recognition request or the GET method to query the recording file recognition result.	No solution is required.
21050001	RUNNING	The recording file recognition task is running.	Use the GET method to send the query request for the recognition result later.
21050002	QUEUEING	The recording file recognition task is queuing.	Use the GET method to send the query request for the recognition result later.
21050003	SUCCESS_WITH_NO_VALID_FRAGMENT	The query request for the recognition result is successful, but the server does not detect any speech data.	Check whether the recording file contains speech data or the duration of speech data is too short.

The following table describes the error codes.

Note

Status codes that start with 4 indicate client errors, whereas those that start with 5 indicate server errors.

Status code	Status message	Description	Solution
41050001	USER_BIZDURATION_QUOTA_EXCEED	The total duration of the recording files that you want to recognize exceeds the quota for the day.	If you need to recognize a large amount of audio data, send an email to nls_support@service.aliyun.com.
41050002	FILE_DOWNLOAD_FAILED	The recording file fails to be downloaded.	Check whether the URL of the recording file is correct or whether the recording file can be accessed and downloaded over the Internet.
41050003	FILE_CHECK_FAILED	The format of the recording file is invalid.	Check whether the recording file is a single-track or dual-track file in WAV or MP3 format.
41050004	FILE_TOO_LARGE	The recording file is too large.	Check whether the recording file is larger than 512 MB in size.
41050005	FILE_NORMALIZE_FAILED	The recording file fails to be normalized.	Check whether the recording file is damaged or cannot be played.
41050006	FILE_PARSE_FAILED	The recording file fails to be parsed.	Check whether the recording file is damaged or cannot be played.
41050007	MKV_PARSE_FAILED	The MKV parsing fails.	Check whether the recording file is damaged or cannot be played.
41050008	UNSUPPORTED_SAMPLE_RATE	The audio sampling rate is not supported.	Check whether the audio sampling rate of the recording file is the same as the sampling rate in the automatic speech recognition (ASR) model that is bound to the appkey of your project in the Intelligent Speech Interaction console.
41050009	UNSUPPORTED_ASR_GROUP	The ASR group is not supported.	Check whether the appkey belongs to the same Alibaba Cloud account as the AccessKey pair.
41050010	FILE_TRANS_TASK_EXPIRED	The recording file recognition task expires.	Check whether the task ID exists or expires.
41050011	REQUEST_INVALID_FILE_URL_VALUE	The specified file_link parameter is invalid.	Check whether the file_link parameter is specified in a correct format.
41050012	REQUEST_INVALID_CALLBACK_VALUE	The specified callback_url parameter is invalid.	Check whether the callback_url parameter is specified in a correct format.
41050013	REQUEST_PARAMETER_INVALID	The request parameters are invalid.	Check whether the request body is a valid JSON string.
41050014	REQUEST_EMPTY_APPKEY_VALUE	The appkey parameter is not specified.	Check whether the appkey parameter is specified.
41050015	REQUEST_APPKEY_UNREGISTERED	The specified appkey parameter is invalid.	Check whether the appkey that is indicated by the appkey parameter is valid or whether the appkey belongs to the same Alibaba Cloud account as the specified AccessKey ID.
41050021	RAM_CHECK_FAILED	The RAM user authentication fails.	Check whether the RAM user is authorized to call the Intelligent Speech Interaction API.
41050023	CONTENT_LENGTH_CHECK_FAILED	The specified content-length field is invalid.	When you download the recording file, check whether the length that is indicated by the content-length field in the HTTP response header is the same as the actual length of the recording file.
41050024	FILE_404_NOT_FOUND	The recording file that you want to download does not exist.	Check whether the recording file that you want to download exists.
41050025	FILE_403_FORBIDDEN	You are not authorized to download the recording file.	Check whether you are authorized to download the recording file.
41050026	FILE_SERVER_ERROR	A file server error occurs.	Check whether the server where the recording file is stored works properly.
51050000	INTERNAL_ERROR	An internal error occurs.	If the error code is occasionally returned, ignore it. If the error code is returned multiple times, submit a ticket.
51050001	VAD_FAILED	The voice activity detection (VAD) fails.	If the error code is occasionally returned, ignore it. If the error code is returned multiple times, submit a ticket.
51050002	RECOGNIZE_FAILED	The ASR fails.	If the error code is occasionally returned, ignore it. If the error code is returned multiple times, submit a ticket.
51050003	RECOGNIZE_INTERRUPT	The ASR is interrupted.	If the error code is occasionally returned, ignore it. If the error code is returned multiple times, submit a ticket.
51050004	OFFER_INTERRUPT	The recognition task is prevented from being written to the queue.	If the error code is occasionally returned, ignore it. If the error code is returned multiple times, submit a ticket.
51050005	FILE_TRANS_TIMEOUT	The recognition task fails due to a timeout.	If the error code is occasionally returned, ignore it. If the error code is returned multiple times, submit a ticket.
51050006	FRAGMENT_FAILED	The multi-channel audio data fails to be converted to mono audio data.	If the error code is occasionally returned, ignore it. If the error code is returned multiple times, submit a ticket.

Earlier versions

If you have activated the recording file recognition service without setting the version to 4.0, its version is 2.0 by default. In version 2.0, the recognition result obtained by the callback method differs from that obtained by the polling method. The differences lie in the style and fields of the JSON string. Assume that you set the enable_callback parameter to true and specify the callback_url parameter. The following response shows the recognition result that is obtained by the callback method:

{
        "result": [{
                "begin_time": 340,
                "channel_id": 0,
                "emotion_value": 5.0,
                "end_time": 2365,
                "silence_duration": 0,
                "speech_rate": 177,
                "text": "Weather in Beijing"
        }],
        "task_id": "3f5d4c0c399511e98dc025f34473****",
        "status_code": 21050000,
        "status_text": "SUCCESS",
        "request_time": 1551164878830,
        "solve_time": 1551164879230,
        "biz_duration": 2956
}

Valid URL	Invalid URL
https://aliyun-nls.oss-cn-hangzhou.aliyuncs.com/asr/fileASR/examples/nls-sample-16k.wav	http://127.0.0.1/sample.wav D:\files\sample.wav