Overview - Intelligent Speech Interaction - Alibaba Cloud Documentation Center

The real-time speech recognition service provides the Natural User Interaction (NUI) SDK for mobile clients to recognize speech data streams that last for a long time. The NUI SDK applies to uninterrupted speech recognition scenarios such as conference speeches and live streaming.

Description

Compared with common SDKs, the NUI SDK is smaller in size and supports more comprehensive status management. The NUI SDK provides comprehensive speech processing capabilities and can also serve as an atomic SDK, meeting diverse user requirements. In addition, the NUI SDK uses a unified API.

Features

Supports pulse-code modulation (PCM) encoded 16-bit mono audio files.
Supports the audio sampling rates of 8,000 Hz and 16,000 Hz.
Allows you to specify whether to return intermediate results, whether to add punctuation marks during post-processing, and whether to convert Chinese numerals to Arabic numerals.
Allows you to select linguistic models to recognize speeches in different languages when you manage projects in the Intelligent Speech Interaction console. For more information, see Manage projects.

Endpoints

Access type	Description	URL

Access type	Description	URL
External access from the Internet	This endpoint allows you to access the real-time speech recognition service from any host over the Internet. By default, the Internet access URL is built in the SDK.	wss://nls-gateway-ap-southeast-1.aliyuncs.com/ws/v1

Interaction process

Note

The server adds the task_id parameter to the response header for all responses to indicate the ID of the recognition task. You can record the value of this parameter. If an error occurs, you can submit a ticket to report the task ID and error message.

1. Authenticate the client and initialize the SDK

To establish a WebSocket connection with the server, the client must use a token for authentication. For more information about how to obtain the token, see Obtain an access token.

The following table describes the parameters used for authentication and initialization.

Parameter	Type	Required	Description

Parameter	Type	Required	Description
Parameter	Type	Required	Description
workspace	String	Yes	The working directory from which the SDK reads the configuration file.
app_key	String	Yes	The appkey of your project created in the Intelligent Speech Interaction console.
token	String	Yes	The token provided as the credential for you to use Intelligent Speech Interaction. Make sure that the token is valid. You can set the token when you initialize the SDK and update the token when you set the request parameters.
device_id	String	Yes	The unique identifier of the device, for example, the media access control (MAC) address, serial number, or pseudo unique ID of the device.
debug_path	String	No	The directory where audio files generated during the debugging are stored. If the save_log parameter is set to true when you initialize the SDK, intermediate results are stored in this directory.
save_wav	String	No	This parameter is valid if the save_log parameter is set to true when you initialize the SDK. This parameter specifies whether to store audio files generated during the debugging in the directory specified by the debug_path parameter. Make sure that the directory is writable.

2. Send a request to use the real-time speech recognition service

You must set the request parameters for the client to send a service request. You can set the request parameters in the JSON format by using the setParams method in the SDK. The parameter configuration applies to all service requests. The following table describes the request parameters.

Parameter	Type	Required	Description

Parameter	Type	Required	Description
appkey	String	No	The appkey of your project created in the Intelligent Speech Interaction console. This parameter is generally set when you initialize the SDK.
token	String	No	The token provided as the credential for you to use Intelligent Speech Interaction. You can update the token as required by setting this parameter.
service_type	Int	Yes	The type of speech service to be requested. Set this parameter to 4, which indicates the real-time speech recognition service.
direct_ip	String	No	The IP address that is resolved from the Domain Name System (DNS) domain name. The client completes the resolution and uses the obtained IP address to access the service.
nls_config	JsonObject	No	The service parameters.

The following table describes the parameters in the nls_config parameter.

Parameter	Type	Required	Description

Parameter	Type	Required	Description
sr_format	String	No	The audio encoding format. The real-time speech recognition service supports the Opus and PCM formats. Default value: OPUS. Note: This parameter must be set to PCM if the sample_rate parameter is set to 8000.
sample_rate	Integer	No	The audio sampling rate. Unit: Hz. Default value: 16000. After you set this parameter, you must specify a model or scene that is applicable to the audio sampling rate for your project in the Intelligent Speech Interaction console.
enable_intermediate_result	Boolean	No	Specifies whether to return intermediate results. Default value: False.
enable_punctuation_prediction	Boolean	No	Specifies whether to add punctuation marks during post-processing. Default value: False.
enable_inverse_text_normalization	Boolean	No	Specifies whether to enable inverse text normalization (ITN) during post-processing. Valid values: true and false. Default value: false. If you set this parameter to true, Chinese numerals are converted to Arabic numerals. Note: ITN is not implemented on words.
customization_id	String	No	The ID of the custom speech training model.
vocabulary_id	String	No	The vocabulary ID of custom extensive hotwords.
max_sentence_silence	Integer	No	The threshold for detecting the end of a sentence. If the silence duration exceeds the specified threshold, the system determines the end of a sentence. Unit: milliseconds. Valid values: 200 to 2000. Default value: 800.
enable_words	Boolean	No	Specifies whether to return information about words. Default value: False.
enable_ignore_sentence_timeout	Boolean	No	Specifies whether to ignore the recognition timeout issue of a single sentence in real-time speech recognition. Default value: False.
disfluency	Boolean	No	Specifies whether to enable disfluency detection. Default value: False.
vad_model	String	No	Optional. The ID of the voice activity detection (VAD) model used by the server.
speech_noise_threshold	float	No	The threshold for recognizing audio streams as noise. Valid values: -1 to 1. The closer the parameter value is to -1, the more likely an audio stream is recognized as a normal speech. In other words, noise is more likely recognized as normal speeches and processed by the service. In addition, the closer the parameter value is to 1, the more likely an audio stream is recognized as noise. In other words, normal speeches are more likely recognized as noise and ignored by the service. Note: This parameter is an advanced parameter. Proceed with caution when you adjust the parameter value. Perform a test after you adjust the parameter value.

3. Send audio data from the client

The client cyclically sends audio data to the server and continuously receives recognition results from the server.

If an EVENT_SENTENCE_START event is reported, the server detects the beginning of a sentence. Real-time speech recognition uses VAD to determine the beginning and end of a sentence. For example, the server returns the following response:

{
    "header": {
        "namespace": "SpeechTranscriber",
        "name": "SentenceBegin",
        "status": 20000000,
        "message_id": "a426f3d4618447519c9d85d1a0d1****",
        "task_id": "5ec521b5aa104e3abccf3d361822****",
        "status_text": "Gateway:SUCCESS:Success."
    },
    "payload": {
        "index": 1,
        "time": 0
    }
}

The following table describes the parameters in the header object.

Parameter	Type	Description

Parameter	Type	Description
namespace	String	The namespace of the message.
name	String	The name of the message. The SentenceBegin message indicates that the server detects the beginning of a sentence.
status	Integer	The HTTP status code. It indicates whether the request is successful. For more information, see the "Error codes" section of this topic.
message_id	String	The ID of the message, which is automatically generated by the SDK.
task_id	String	The GUID of the task. Record the value of this parameter to facilitate troubleshooting.
status_text	String	The status message.

The following table describes the parameters in the payload object.

Parameter	Type	Description

Parameter	Type	Description
index	Integer	The sequence number of the sentence, which starts from 1.
time	Integer	The duration of the processed audio stream. Unit: milliseconds.

If the enable_intermediate_result parameter is set to true, the SDK reports multiple EVENT_ASR_PARTIAL_RESULT events by calling the onNuiEventCallback method to return intermediate results of a sentence. For example, the server returns the following response:

{
    "header": {
        "namespace": "SpeechTranscriber",
        "name": "TranscriptionResultChanged",
        "status": 20000000,
        "message_id": "dc21193fada84380a3b6137875ab****",
        "task_id": "5ec521b5aa104e3abccf3d361822****",
        "status_text": "Gateway:SUCCESS:Success."
    },
    "payload": {
        "index": 1,
        "time": 1835,
        "result": "Sky in Beijing",
        "confidence": 1.0,
        "words": [{
            "text": "Sky",
            "startTime": 630,
            "endTime": 930
        }, {
            "text": "in",
            "startTime": 930,
            "endTime": 1110
        }, {
            "text": "Beijing",
            "startTime": 1110,
            "endTime": 1140
        }]
    }
}

Note

As shown in the header object, the value of the name parameter is TranscriptionResultChanged, which indicates that an intermediate result is obtained. For more information about other parameters in the header object, see the preceding table.

The following table describes the parameters in the payload object.

Parameter	Type	Description

Parameter	Type	Description
index	Integer	The sequence number of the sentence, which starts from 1.
time	Integer	The duration of the processed audio stream. Unit: milliseconds.
result	String	The recognition result of the sentence.
words	List< Word >	The word information of the sentence. The word information is returned only when the enable_words parameter is set to true.
confidence	Double	The confidence level of the recognition result of the sentence. Valid values: 0.0 to 1.0. A larger value indicates a higher confidence level.

If an EVENT_SENTENCE_END event is reported, the server detects the end of a sentence and returns the recognition result of the sentence. For example, the server returns the following response:

{
    "header": {
        "namespace": "SpeechTranscriber",
        "name": "SentenceEnd",
        "status": 20000000,
        "message_id": "c3a9ae4b231649d5ae05d4af36fd****",
        "task_id": "5ec521b5aa104e3abccf3d361822****",
        "status_text": "Gateway:SUCCESS:Success."
    },
    "payload": {
        "index": 1,
        "time": 1820,
        "begin_time": 0,
        "result": "Weather in Beijing.",
        "confidence": 1.0,
        "words": [{
            "text": "Weather",
            "startTime": 630,
            "endTime": 930
        }, {
            "text": "in",
            "startTime": 930,
            "endTime": 1110
        }, {
            "text": "Beijing",
            "startTime": 1110,
            "endTime": 1380
        }]
    }
}

Note

As shown in the header object, the value of the name parameter is SentenceEnd, which indicates that the server detects the end of the sentence. For more information about other parameters in the header object, see the preceding table.

The following table describes the parameters in the payload object.

Parameter	Type	Description

Parameter	Type	Description
index	Integer	The sequence number of the sentence, which starts from 1.
time	Integer	The duration of the processed audio stream. Unit: milliseconds.
begin_time	Integer	The time when the server returns the SentenceBegin message for the sentence. Unit: milliseconds.
result	String	The recognition result of the sentence.
words	List< Word >	The word information of the sentence. The word information is returned only when the enable_words parameter is set to true.
confidence	Double	The confidence level of the recognition result of the sentence. Valid values: 0.0 to 1.0. A larger value indicates a higher confidence level.

The following table describes the parameters in the words object.

Parameter	Type	Description

Parameter	Type	Description
text	String	The text of the word.
startTime	Integer	The time when the word appears in the sentence. Unit: milliseconds.
endTime	Integer	The time when the word ends in the sentence. Unit: milliseconds.

4. Complete the recognition task

The client notifies the server that all audio data is sent. The server completes the recognition task and notifies the client that the task is completed.

Error codes

For more information about the error codes that the real-time speech recognition service may return, see error code.

關於 Alibaba Cloud

環球網絡

快速入門

環球辦公室

2024年巴黎奧運會 New

羅蘭加洛斯球場 — 昔日榮光 New

協和廣場 —「打破」障礙 New

馬恩河畔韋爾航海體育場 — 可持續運動 New

國際廣播中心 — 吸引數十億人的影像、聲音和數據 New

客戶成功案例 New

阿里雲信任中心

合規計劃

雲端合規資源

合規常見問題

最新產品及功能 New

Cloud Forward

新聞發佈室

阿里雲電子期刊 New

Alibaba Cloud 分析師研究

公告

阿里雲出海業務 New

“橙”雲出海服務聯盟

Asia Accelerator Hot

資訊合規

China Gateway - MLPS 2.0 合規 New

China Gateway - 網絡

China Gateway - 全球加速應用程式 New

China Gateway - 安全

China Gateway - 數據安全 New

ICP 支援 Hot

China Gateway - 全域數據中台 New

China Gateway - 組織數據中端 New

China Gateway - 業務中端 New

China Gateway - 智慧客服解決方案 New

China Gateway - 網上教育

China Gateway - 網域註冊

工作在阿里雲

資深專業人士

學生和畢業生

免费试用

定價

優惠中心

減價

付出更少金錢，進行更多部署

FinOps

Elastic Compute Service (ECS)

Simple Application Server (SAS)

Elastic GPU Service

Elastic Desktop Service (EDS)

Object Storage Service (OSS)

Cloud Enterprise Network (CEN)

Web Application Firewall (WAF)

Domain Names

Container Compute Service (ACS)

Secure Access Service Edge (SASE)

Intelligent Media Services(IMS)

Edge Security Acceleration (ESA)(Original DCDN)

Intelligent Media Management

DingTalk Enterprise

YiDA

Alibaba Cloud Model Studio

Apsara Prime - 更輕鬆選擇雲端產品

阿裡雲ECS-滿足您所有的雲託管需求

1TB CDN—立即獲享免費 1 TB 輸出流量方案

安全性—面臨攻擊？ 獲享免費安全支援

Short Message Service - 免費測試現已登場

Elastic Compute Service (ECS) Hot

CloudBox

Compute Nest

Dedicated Host Hot

ECS Bare Metal Instance

Elastic GPU Service Featured

Simple Application Server (SAS) Hot

Auto Scaling

Cloud Phone Beta

Elastic Desktop Service (EDS) Featured

Batch Compute

Elastic High Performance Computing (E-HPC)

Super Computing Cluster (SCC)

Function Compute (FC)

安全性—面臨攻擊？獲享免費安全支援