What is Intelligent Speech Interaction? - Intelligent Speech Interaction

Intelligent Speech Interaction is developed based on state-of-the-art technologies such as speech recognition, speech synthesis, and natural language understanding. You can integrate Intelligent Speech Interaction into the applications of your enterprise to enable the applications to listen to, understand, and speak to users. This way, users can enjoy immersive human-computer interaction experience. Intelligent Speech Interaction is suitable for various scenarios, including intelligent Q&A, intelligent quality inspection, real-time recording for court trials, real-time subtitling for speeches, and transcription of audio recordings. Intelligent Speech Interaction has been applied to many fields such as finance, justice, and e-commerce.

Note

Intelligent Speech Interaction V2.0 is released. The new version provides you with easy-to-use SDKs and a feature-rich console where you can use features such as the self-learning platform to improve speech recognition performance. You are welcome to activate Intelligent Speech Interaction.

Video tutorial

Short sentence recognition

Short sentence recognition recognizes a short speech that lasts within 1 minute. The service applies to short speech interaction scenarios, such as voice search, voice command control, and voice short message. It can also be integrated into various mobile apps, smart home appliances, and smart voice assistants. For more information, see Overview of short sentence recognition.

Real-time speech recognition

Real-time speech recognition recognizes audio streams of various lengths in real time to achieve the effect of text output on speaking. The built-in feature of intelligent sentence breaking recognizes the start time and end time of each sentence. Real-time speech recognition applies to scenarios where you need to create subtitles in live videos, record meetings and court trials, and use smart voice assistants in real time. For more information, see API reference of real-time speech recognition.

Recording file recognition

Recording file recognition recognizes recording files that you upload. This service applies to scenarios where you need to check the audio quality of call centers, record court trials in databases, summarize meeting minutes, and file medical records. For more information, see API reference of recording file recognition.

Note

If you are using the free trial edition, the system completes the recognition and returns the text converted from the speech data within 24 hours after you upload a recording file. If you are using a paid edition, the system completes the recognition and returns the text converted from the speech data within 3 hours after you upload a recording file. However, if you upload large amounts of speech data, such as a recording file that lasts more than 500 hours in half an hour, it takes more time for the system to complete the recognition. If you need to convert large amounts of speech data to text at a time, contact the Alibaba Cloud pre-sales staff.

Speech synthesis

Speech synthesis is developed based on the deep learning technology to convert text to a natural-sounding and fluent speech. The service provides multiple speakers and allows you to adjust the speed, intonation, and volume of the generated speech. Speech synthesis applies to scenarios such as intelligent customer service, speech interaction, audio book reading, and accessible broadcast. For more information, see Overview of speech synthesis.

CosyVoice foundation model for speech synthesis

The CosyVoice foundation model for speech synthesis is a new speech synthesis technology that integrates text understanding and speech generation based on pre-trained large language models (LLMs). It can accurately parse and interpret text content and convert the text into a natural-sounding speech.

Speech synthesis speaker customization (Enterprise Edition)

Based on the deep learning technology, the speech synthesis speaker customization service allows you to customize high-performance text-to-speech (TTS) speakers at a fast speed by using a small amount of training data. You can use the custom speakers for speech synthesis both in the Intelligent Speech Interaction console or on your smart device.

If you need to customize speakers or further understand the customization process, send emails to nls_support@service.aliyun.com.

Self-learning platform

You can use the self-learning platform to improve the performance of speech recognition by using hotword training and custom linguistic models. You can add terms as hotwords and upload business-specific corpus to train linguistic models. In fields such as justice and finance, you can customize and optimize linguistic models to improve the accuracy of speech recognition in industry-specific scenarios.

References

Getting Started: describes how to get started with Intelligent Speech Interaction.
Pricing: describes the billing of Intelligent Speech Interaction.
Developer Guide: describes the terms related to Intelligent Speech Interaction and further describes how to use Intelligent Speech Interaction, such as how to obtain an access token.
Console User Guide: describes the features provided in the Intelligent Speech Interaction console.
Documentation of a speech service: describes how to use a specific speech service, such as short sentence recognition, real-time speech recognition, recording file recognition, or speech synthesis.
Self-learning Platform: describes how to improve the performance of speech recognition by using the hotword training and custom linguistic model features provided by the self-learning platform.
Best Practices: provides the best practices for using Intelligent Speech Interaction.
FAQ: provides answers to frequently asked questions about Intelligent Speech Interaction.