The speech synthesis service generates a timestamp, which indicates the time point on an audio stream, for each word in a sentence. The timestamp feature of speech synthesis is also named phoneme boundary detection for each word. The timestamp can be used for the virtual speakers and video subtitles.
This feature can be used only for speakers that support phoneme boundary detection for each word.
Request parameters
To enable the timestamp feature, set the request parameter enable_subtitle to true when you initiate a request on the client.
Assume that you are using the SDK for Java. You can use the following configuration:
// Specify whether to enable the timestamp feature to return the corresponding timestamps of the text to be sent. By default, this feature is not enabled.
synthesizer.addCustomedParam("enable_subtitle", true);
Server response
If you set the enable_subtitle parameter to true in a request, the server returns a MetaInfo event that contains the timestamps corresponding to the text that is sent.
Parameter | Type | Description |
---|---|---|
subtitles | List | The information about the timestamps. |
The following table describes the parameters contained in subtitles.
Parameter | Type | Description |
---|---|---|
text | String | The word in the text that is sent. |
begin_time | Integer | The start timestamp of the word in the synthesized audio data, in milliseconds. |
end_time | Integer | The end timestamp of the word in the synthesized audio data, in milliseconds. |
Sample output
{
"header": {
"message_id": "05450bf69c53413f8d88aed1ee60****",
"task_id": "640bc797bb684bd6960185651307****",
"namespace": "SpeechSynthesizer",
"name": "MetaInfo",
"status": 20000000,
"status_message": "GATEWAY|SUCCESS|Success."
},
"payload": {
"subtitles": [
{
"text": "xx",
"begin_time": 130,
"end_time": 260
},
{
"text": "xx",
"begin_time": 260,
"end_time": 370
}
]
}
}
Note
The speech synthesis service returns subtitles based on how the original text is read. Therefore, you cannot use the video subtitles generated by the timestamp feature as subtitles on the screen. Instead, you must use the original text.
If you use this feature to generate video subtitles, you can obtain the start and end timestamps of each sentence based on the returned response.