Timestamp feature - Intelligent Speech Interaction - Alibaba Cloud Documentation Center

The speech synthesis service generates a timestamp, which indicates the time point on an audio stream, for each word in a sentence. The timestamp feature of speech synthesis is also named phoneme boundary detection for each word. The timestamp can be used for the virtual speakers and video subtitles.

Notice

This feature can be used only for speakers that support phoneme boundary detection for each word.

Request parameters

To enable the timestamp feature, set the request parameter enable_subtitle to true when you initiate a request on the client.

Assume that you are using the SDK for Java. You can use the following configuration:

// Specify whether to enable the timestamp feature to return the corresponding timestamps of the text to be sent. By default, this feature is not enabled.  
synthesizer.addCustomedParam("enable_subtitle", true);

Server response

If you set the enable_subtitle parameter to true in a request, the server returns a MetaInfo event that contains the timestamps corresponding to the text that is sent.

Parameter	Type	Description
subtitles	List	The information about the timestamps.

The following table describes the parameters contained in subtitles.

Parameter	Type	Description
text	String	The word in the text that is sent.
begin_time	Integer	The start timestamp of the word in the synthesized audio data, in milliseconds.
end_time	Integer	The end timestamp of the word in the synthesized audio data, in milliseconds.

Sample output

{
    "header": {
        "message_id": "05450bf69c53413f8d88aed1ee60****",
        "task_id": "640bc797bb684bd6960185651307****",
        "namespace": "SpeechSynthesizer",
        "name": "MetaInfo",
        "status": 20000000,
        "status_message": "GATEWAY|SUCCESS|Success."
    },
    "payload": {
        "subtitles": [
            {
                "text": "xx",
                "begin_time": 130,
                "end_time": 260
            },
            {
                "text": "xx",
                "begin_time": 260,
                "end_time": 370
            }
        ]
    }
}

Note

The speech synthesis service returns subtitles based on how the original text is read. Therefore, you cannot use the video subtitles generated by the timestamp feature as subtitles on the screen. Instead, you must use the original text.
If you use this feature to generate video subtitles, you can obtain the start and end timestamps of each sentence based on the returned response.