You can use the recording file recognition service to recognize recording files. However, the service does not recognize recording files in real time. In addition, to recognize a recording file, you must submit a reachable HTTP or HTTPS URL of the file, but not the local file.
Features
Recognizes single-track recording files in WAV and MP3 formats.
Supports two call methods: polling and callback.
Supports custom linguistic models and hotwords.
Recognizes multiple languages, such as Chinese Mandarin, Chinese dialects, and English.
Call limits
The access permissions on recording files that you want to recognize must be public. The URL of each recording file can contain the domain name, but not the IP address. In addition, the URL cannot contain spaces.
Valid URL
Invalid URL
https://aliyun-nls.oss-cn-hangzhou.aliyuncs.com/asr/fileASR/examples/nls-sample-16k.wav
http://127.0.0.1/sample.wav
D:\files\sample.wav
The maximum file size is 512 MB.
If you use the free trial edition, the server completes the recognition task and returns the recognition result within 24 hours after you send a recording file recognition request. If you use Commercial Edition, the server completes the recognition task and returns the recognition result within 3 hours after you send a recording file recognition request. The server retains the recognition result for 72 hours.
NoteThe preceding time limits do not apply if the recording files that you upload within 30 minutes exceed 500 hours in length. If you need to recognize a large amount of audio data, contact the Alibaba Cloud pre-sales staff.
You can use the free trial edition to recognize recording files that are up to 2 hours in length on each calendar day.
Check the format and audio sampling rate of your recording file. Select an appropriate scenario and model in the Intelligent Speech Interaction console based on your business scenario.
Store the recording file in Alibaba Cloud Object Storage Service (OSS).
If the access permissions on the recording file are public, directly obtain the OSS URL of the recording file. For more information, see Public read object. If the access permissions on the recording file are private, use the SDK to generate an OSS URL that has a validity period. For more information, see Private object.
NoteYou can also build a file server and store the recording file on it. To download the recording file from the file server, make sure that the length indicated by the
Content-Length
field in the HTTP response header is the same as the length of data in the response body. Otherwise, the recording file fails to be downloaded.Send a recording file recognition request from the client.
If the request is successful, the server returns the task ID. You can use the task ID to poll the recognition result.
Send a request from the client to query the recognition result.
The client queries the recognition result based on the task ID that is obtained in Step c. The server retains the recognition result for 72 hours.
The recording file recognition service provides the Alibaba Cloud pctowap open platform (POP) API that can be called in a remote procedure call (RPC) style. To call an API operation, the client encapsulates parameters in a request and uses an HTTP method to send the request. The server returns the result in a response. You must store recording files that you want to recognize on a server and make sure that each file can be accessed by using a URL. We recommend that you store recording files in Alibaba Cloud OSS.
The recording file recognition POP API supports two operations: use the POST method to send a recording file recognition request and use the GET method to query the recording file recognition result.
Operation to send a recording file recognition request:
If you use the polling method, you can send a recording file recognition request and obtain the task ID for subsequent recognition result polling.
If you use the callback method, you can send a recording file recognition request and a callback URL. If the request is successful, the server uses the POST method to send the recognition result to the callback URL. Make sure that the callback URL can receive a POST request.
NoteIn earlier versions of the recording file recognition service (2.0 by default), the recognition result obtained by the callback method differs from that obtained by the polling method. The differences lie in the style and fields of the JSON string. In version 4.0, the recording file recognition service updates the recognition result obtained by the callback method to a camelCase JSON string. This produces the same recognition result as that obtained by the polling method.
If you have activated the recording file recognition service without setting the version to 4.0, its version is 2.0 by default. You can continue to use this version. If you are a new user, set the version of the recording file recognition service to 4.0.
Request parameters:
When you send a recording file recognition request, you must set request parameters and add these parameters in the format of a JSON string to the request body. The following example shows request parameters in JSON format:
{ "appkey": "your-appkey", "file_link": "https://aliyun-nls.oss-cn-hangzhou.aliyuncs.com/asr/fileASR/examples/nls-sample-16k.wav", "auto_split":false, "version": "4.0", "enable_words": false, "enable_sample_rate_adaptive": true, // The valid_times parameter specifies the valid time period that truly requires speech recognition in the total length of an audio track. This parameter is optional. "valid_times": [ { "begin_time": 200, "end_time":2000, "channel_id": 0 } ] }
Parameter
Type
Required
Description
appkey
String
Yes
The appkey of your project in the Intelligent Speech Interaction console.
file_link
String
Yes
The URL of the recording file. Make sure that the scenario and model of the project created in the Intelligent Speech Interaction console are suitable for the recording file.
version
String
Yes
The version of the recording file recognition service. Default value: 2.0. Set this parameter to 4.0.
enable_words
Boolean
No
Specifies whether to return the recognition results of words. Default value: false. This parameter takes effect only when the version parameter is set to 4.0.
enable_sample_rate_adaptive
Boolean
No
Specifies whether to automatically downsample an audio file with a sampling rate that is greater than 16,000 Hz. Default value: false. This parameter takes effect only when the version parameter is set to 4.0.
enable_callback
Boolean
No
Specifies whether to enable the callback method. Default value: false.
callback_url
String
No
The callback URL. You must specify this parameter if you set the enable_callback parameter to true. The callback URL can be an HTTP or HTTPS URL. It can contain the domain name, but not the IP address.
auto_split
Boolean
No
Specifies whether to enable automatic track splitting. If you enable automatic track splitting, the server can identify the speaker of each sentence in a conversation between two parties based on the ChannelId parameter in the recognition result of the sentence. Usually, the value of the ChannelId parameter is 1 for the first speaker in the conversation. Only mono audio files with a sampling rate of 8,000 Hz are supported.
enable_unify_post
Boolean
No
Specifies whether to enable post-processing. Default value: false.
NoteThe auto_split and enable_unify_post parameters cannot be both set to true.
enable_inverse_text_normalization
Boolean
No
Specifies whether to enable inverse text normalization (ITN). Valid values: true and false. Default value: false. If you set this parameter to true, Chinese numerals are converted to Arabic numerals. This parameter takes effect only when the version parameter is set to 4.0 and the enable_unify_post parameter is set to true.
NoteITN is not implemented on words.
enable_disfluency
Boolean
No
Specifies whether to enable disfluency detection. Default value: false. This parameter takes effect only when the version parameter is set to 4.0 and the enable_unify_post parameter is set to true.
valid_times
List< ValidTime >
No
The valid time period that truly requires speech recognition in the total length of an audio track.
max_end_silence
Integer
No
The maximum duration of end silence. Default value: 450. Unit: milliseconds.
max_single_segment_time
Integer
No
The maximum duration of a single sentence. Minimum value: 10000. Default value: 20000. Unit: milliseconds.
customization_id
String
No
The ID of the custom linguistic model that is created by using the POP API. This parameter is not specified by default.
class_vocabulary_id
String
No
The ID of the created categorized hotword vocabulary. This parameter is not specified by default.
vocabulary_id
String
No
The ID of the created extensive hotword vocabulary. This parameter is not specified by default.
The following table describes the parameters in the ValidTime object.
Parameter
Type
Required
Description
begin_time
Int
Yes
The start time offset of the valid time period. Unit: milliseconds.
end_time
Int
Yes
The end time offset of the valid time period. Unit: milliseconds.
channel_id
Int
Yes
The sequence number of the audio track to which the setting of the valid time period applies. The value starts from 0.
Response parameters:
The server returns a response to the recording file recognition request. The response includes response parameters in the format of a JSON string. For example, the server returns the following response:
{ "TaskId": "4b56f0c4b7e611e88f34c33c2a60****", "RequestId": "E4B183CC-6CFE-411E-A547-D877F7BD****", "StatusText": "SUCCESS", "StatusCode": 21050000 }
HTTP status code 200 indicates that the request is successful. For more information, see HTTP status codes.
Parameter
Type
Required
Description
TaskId
String
Yes
The ID of the recognition task.
RequestId
String
Yes
The ID of the request. This parameter is used only for debugging.
StatusCode
Int
Yes
The status code.
StatusText
String
Yes
The status message.
Operation to query the recording file recognition result:
If the recording file recognition request that you send is successful, the server returns the task ID. You can use the task ID to poll the recognition result.
Request parameters:
After the server returns the response to the recording file recognition request, you can use the task ID in the response as a parameter to query the recognition result. When you call the query operation, you must set a polling interval.
ImportantThe query operation supports up to 100 queries per second (QPS). If the QPS exceeds 100, the following error may be returned:
Throttling.User : Request was denied due to user flow control.
We recommend that you set a longer polling interval.Parameter
Type
Required
Description
TaskId
String
Yes
The ID of the recognition task.
Response parameters:
The server returns a response to the query request for the recording file recognition result. The response includes response parameters in the format of a JSON string.
The following sample success response shows the recognition result of the single-track recording file nls-sample-16k.wav:
{ "TaskId": "d429dd7dd75711e89305ab6170fe****", "RequestId": "9240D669-6485-4DCC-896A-F8B31F94****", "StatusText": "SUCCESS", "BizDuration": 2956, "SolveTime": 1540363288472, "StatusCode": 21050000, "Result": { "Sentences": [{ "EndTime": 2365, "SilenceDuration": 0, "BeginTime": 340, "Text": "Weather in Beijing", "ChannelId": 0, "SpeechRate": 177, "EmotionValue": 5.0 }] } }
Assume that you set the enable_callback parameter to true, specify the callback_url parameter, and set the version parameter to 4.0. The following response shows the recognition result that is obtained by the callback method:
{ "Result": { "Sentences": [{ "EndTime": 2365, "SilenceDuration": 0, "BeginTime": 340, "Text": "Weather in Beijing", "ChannelId": 0, "SpeechRate": 177, "EmotionValue": 5.0 }] }, "TaskId": "36d01b244ad811e9952db7bb7ed2****", "StatusCode": 21050000, "StatusText": "SUCCESS", "RequestTime": 1553062810452, "SolveTime": 1553062810831, "BizDuration": 2956 }
NoteThe value of the RequestTime parameter is a timestamp that indicates when the recording file recognition request is sent, in milliseconds. For example, a value of 1553062810452 indicates 14:20:10 on March 20, 2019, UTC+8.
The value of the SolveTime parameter is a timestamp that indicates when the recording file recognition task is completed, in milliseconds.
The following response shows that the task is queuing:
{ "TaskId": "c7274235b7e611e88f34c33c2a60****", "RequestId": "981AD922-0655-46B0-8C6A-5C836822****", "StatusText": "QUEUEING", "StatusCode": 21050002 }
The following response shows that the task is running:
{ "TaskId": "c7274235b7e611e88f34c33c2a60****", "RequestId": "8E908ED2-867F-457E-82BF-4756194A****", "StatusText": "RUNNING", "BizDuration": 0, "StatusCode": 21050001 }
The following sample error response shows that the recording file fails to be downloaded:
{ "TaskId": "4cf25b7eb7e711e88f34c33c2a60****", "RequestId": "098BF27C-4CBA-45FF-BD11-3F532F26****", "StatusText": "FILE_DOWNLOAD_FAILED", "BizDuration": 0, "SolveTime": 1536906469146, "StatusCode": 41050002 }
NoteFor more information, see the error codes and solutions in the "Service status codes" section of this topic.
HTTP status code 200 indicates that the request is successful. For more information, see HTTP status codes.
Parameter
Type
Required
Description
TaskId
String
Yes
The ID of the recognition task.
StatusCode
Int
Yes
The status code.
StatusText
String
Yes
The status message.
RequestId
String
Yes
The ID of the request. This parameter is used for debugging.
Result
Object
Yes
The recognition result object.
Sentences
List< SentenceResult >
Yes
The recognition results of sentences. This parameter is returned only when the value of the StatusText parameter is SUCCESS.
Words
List< WordResult >
No
The recognition results of words. This parameter is returned only when the enable_words parameter is set to true and the version parameter is set to 4.0.
BizDuration
Long
Yes
The total duration of the recording file that is recognized. Unit: milliseconds.
SolveTime
Long
Yes
The timestamp that indicates when the recording file recognition task is completed. Unit: milliseconds.
The following table describes the parameters in the recognition result of each sentence.
Parameter
Type
Required
Description
ChannelId
Int
Yes
The ID of the audio track to which the sentence belongs.
BeginTime
Int
Yes
The start time offset of the sentence. Unit: milliseconds.
EndTime
Int
Yes
The end time offset of the sentence. Unit: milliseconds.
Text
String
Yes
The recognition result of the sentence.
EmotionValue
Int
Yes
The emotion value. The value is equal to the volume decibel value divided by 10. Valid values: [1,10]. A greater value indicates a stronger emotion.
SilenceDuration
Int
Yes
The silence duration between the current and the previous sentences. Unit: seconds.
SpeechRate
Int
Yes
The average speech rate of the sentence. Unit: words per minute.
Recognition results of words
If the enable_words parameter is set to true and the version parameter is set to 4.0, the server returns the recognition results of words in the response. The recognition results of words obtained by the polling method are the same as those obtained by the callback method. The following response shows the recognition results that are obtained by the polling method:
{ "StatusCode": 21050000, "Result": { "Sentences": [{ "SilenceDuration": 0, "EmotionValue": 5.0, "ChannelId": 0, "Text": "Weather in Beijing", "BeginTime": 340, "EndTime": 2365, "SpeechRate": 177 }], "Words": [{ "ChannelId": 0, "Word": "Weather", "BeginTime": 640, "EndTime": 940 }, { "ChannelId": 0, "Word": "in", "BeginTime": 940, "EndTime": 1120 }, { "ChannelId": 0, "Word": "Beijing", "BeginTime": 1120, "EndTime": 2020 }] }, "SolveTime": 1553236968873, "StatusText": "SUCCESS", "RequestId": "027B126B-4AC8-4C98-9FEC-A031158F****", "TaskId": "b505e78c4c6d11e9a213e11db149****", "BizDuration": 2956 }
The following table describes the parameters in the recognition result of each word.
Parameter
Type
Required
Description
BeginTime
Int
Yes
The start time of the word. Unit: milliseconds.
EndTime
Int
Yes
The end time of the word. Unit: milliseconds.
ChannelId
Int
Yes
The ID of the audio track to which the word belongs.
Word
String
Yes
The recognition result of the word.
The following table describes the normal status codes.
The following table describes the error codes.
Procedure
API call methods
Service status codes
Status code | Status message | Description | Solution |
21050000 | SUCCESS | The request is successful after you use the POST method to send a recording file recognition request or the GET method to query the recording file recognition result. | No solution is required. |
21050001 | RUNNING | The recording file recognition task is running. | Use the GET method to send the query request for the recognition result later. |
21050002 | QUEUEING | The recording file recognition task is queuing. | Use the GET method to send the query request for the recognition result later. |
21050003 | SUCCESS_WITH_NO_VALID_FRAGMENT | The query request for the recognition result is successful, but the server does not detect any speech data. | Check whether the recording file contains speech data or the duration of speech data is too short. |
Status codes that start with 4 indicate client errors, whereas those that start with 5 indicate server errors.
Status code | Status message | Description | Solution |
41050001 | USER_BIZDURATION_QUOTA_EXCEED | The total duration of the recording files that you want to recognize exceeds the quota for the day. | If you need to recognize a large amount of audio data, send an email to nls_support@service.aliyun.com. |
41050002 | FILE_DOWNLOAD_FAILED | The recording file fails to be downloaded. | Check whether the URL of the recording file is correct or whether the recording file can be accessed and downloaded over the Internet. |
41050003 | FILE_CHECK_FAILED | The format of the recording file is invalid. | Check whether the recording file is a single-track or dual-track file in WAV or MP3 format. |
41050004 | FILE_TOO_LARGE | The recording file is too large. | Check whether the recording file is larger than 512 MB in size. |
41050005 | FILE_NORMALIZE_FAILED | The recording file fails to be normalized. | Check whether the recording file is damaged or cannot be played. |
41050006 | FILE_PARSE_FAILED | The recording file fails to be parsed. | Check whether the recording file is damaged or cannot be played. |
41050007 | MKV_PARSE_FAILED | The MKV parsing fails. | Check whether the recording file is damaged or cannot be played. |
41050008 | UNSUPPORTED_SAMPLE_RATE | The audio sampling rate is not supported. | Check whether the audio sampling rate of the recording file is the same as the sampling rate in the automatic speech recognition (ASR) model that is bound to the appkey of your project in the Intelligent Speech Interaction console. |
41050009 | UNSUPPORTED_ASR_GROUP | The ASR group is not supported. | Check whether the appkey belongs to the same Alibaba Cloud account as the AccessKey pair. |
41050010 | FILE_TRANS_TASK_EXPIRED | The recording file recognition task expires. | Check whether the task ID exists or expires. |
41050011 | REQUEST_INVALID_FILE_URL_VALUE | The specified file_link parameter is invalid. | Check whether the file_link parameter is specified in a correct format. |
41050012 | REQUEST_INVALID_CALLBACK_VALUE | The specified callback_url parameter is invalid. | Check whether the callback_url parameter is specified in a correct format. |
41050013 | REQUEST_PARAMETER_INVALID | The request parameters are invalid. | Check whether the request body is a valid JSON string. |
41050014 | REQUEST_EMPTY_APPKEY_VALUE | The appkey parameter is not specified. | Check whether the appkey parameter is specified. |
41050015 | REQUEST_APPKEY_UNREGISTERED | The specified appkey parameter is invalid. | Check whether the appkey that is indicated by the appkey parameter is valid or whether the appkey belongs to the same Alibaba Cloud account as the specified AccessKey ID. |
41050021 | RAM_CHECK_FAILED | The RAM user authentication fails. | Check whether the RAM user is authorized to call the Intelligent Speech Interaction API. |
41050023 | CONTENT_LENGTH_CHECK_FAILED | The specified content-length field is invalid. | When you download the recording file, check whether the length that is indicated by the content-length field in the HTTP response header is the same as the actual length of the recording file. |
41050024 | FILE_404_NOT_FOUND | The recording file that you want to download does not exist. | Check whether the recording file that you want to download exists. |
41050025 | FILE_403_FORBIDDEN | You are not authorized to download the recording file. | Check whether you are authorized to download the recording file. |
41050026 | FILE_SERVER_ERROR | A file server error occurs. | Check whether the server where the recording file is stored works properly. |
51050000 | INTERNAL_ERROR | An internal error occurs. | If the error code is occasionally returned, ignore it. If the error code is returned multiple times, submit a ticket. |
51050001 | VAD_FAILED | The voice activity detection (VAD) fails. | If the error code is occasionally returned, ignore it. If the error code is returned multiple times, submit a ticket. |
51050002 | RECOGNIZE_FAILED | The ASR fails. | If the error code is occasionally returned, ignore it. If the error code is returned multiple times, submit a ticket. |
51050003 | RECOGNIZE_INTERRUPT | The ASR is interrupted. | If the error code is occasionally returned, ignore it. If the error code is returned multiple times, submit a ticket. |
51050004 | OFFER_INTERRUPT | The recognition task is prevented from being written to the queue. | If the error code is occasionally returned, ignore it. If the error code is returned multiple times, submit a ticket. |
51050005 | FILE_TRANS_TIMEOUT | The recognition task fails due to a timeout. | If the error code is occasionally returned, ignore it. If the error code is returned multiple times, submit a ticket. |
51050006 | FRAGMENT_FAILED | The multi-channel audio data fails to be converted to mono audio data. | If the error code is occasionally returned, ignore it. If the error code is returned multiple times, submit a ticket. |
Earlier versions
If you have activated the recording file recognition service without setting the version to 4.0, its version is 2.0 by default. In version 2.0, the recognition result obtained by the callback method differs from that obtained by the polling method. The differences lie in the style and fields of the JSON string. Assume that you set the enable_callback parameter to true and specify the callback_url parameter. The following response shows the recognition result that is obtained by the callback method:
{
"result": [{
"begin_time": 340,
"channel_id": 0,
"emotion_value": 5.0,
"end_time": 2365,
"silence_duration": 0,
"speech_rate": 177,
"text": "Weather in Beijing"
}],
"task_id": "3f5d4c0c399511e98dc025f34473****",
"status_code": 21050000,
"status_text": "SUCCESS",
"request_time": 1551164878830,
"solve_time": 1551164879230,
"biz_duration": 2956
}