Support for timing / alignment data / SRT format | Voters

Support for timing / alignment data / SRT format

Hume Operations

This feature request is to provide timing data similar to what ElevenLabs provides:
* Timing information about the timestamps in the audio of when the speech reaches a certain point, at the character level.
* Alignments i.e. -- "lipsync" data that helps in knowing what the shape of the mouth would be at each timestamp if somebody were to speak this audio.
* Possible support for formats like SRT?

March 1, 2025

Ilya

I would love to see this feature too. Some short format videos today use subtitles where each word is being highlighted separately.
Here is the example of a format that I'd love to get together with the response:
"words": [
{
"word": "ready",
"start": 0.08,
"end": 0.39999998,
"confidence": 0.9988054,
"punctuated_word": "Ready"
},
{
"word": "to",
"start": 0.39999998,
"end": 0.48,
"confidence": 0.99993455,
"punctuated_word": "to"
},
{
"word": "have",
"start": 0.48,
"end": 0.71999997,
"confidence": 0.99994206,
"punctuated_word": "have"
},
{
"word": "your",
"start": 0.71999997,
"end": 0.88,
"confidence": 0.9997631,
"punctuated_word": "your"
},
{
"word": "mind",
"start": 0.88,
"end": 1.28,
"confidence": 0.99993217,
"punctuated_word": "mind"
},
{
"word": "blown",
"start": 1.28,
"end": 1.68,
"confidence": 0.99995804,
"punctuated_word": "blown"
},]
It's pretty universal format that allows flexibility to decide whether I want to show the whole line using the timestamps range or highlight each word separately.
Latency is not the issue as I'd use this tts in offline via api. Maybe you guys can add a flag to the request to add timing to the API response

Richard Marmorstein

Merged in a post:

TTS timing and alignments

Richard Marmorstein

This feature request is to provide timing data similar to what ElevenLabs provides:
* Timing information about the timestamps in the audio of when the speech reaches a certain point, at the character level.
* Alignments i.e. -- "lipsync" data that helps in knowing what the shape of the mouth would be at each timestamp if somebody were to speak this audio.

March 21, 2025

Richard Marmorstein

This is related to https://hume.canny.io/feature-requests/p/tts-timing-and-alignments -- a broader feature request for the API to return timing data.

Richard Marmorstein

One thing to try: you can take Octave's Text-To-Speech, find a third party API that provides Speech-To-Text with alignment, and put Octave's output into that to get the timing data.

Not as good as a complete solution with Hume: it will add cost and latency; but for now if you really really like an Octave voice but require timing data, this is the way to have your cake and eat it too.

Sean Ray

I am a power user of this ElevenLabs API method with roboedit.app . I would love to see Hume introduce this. I've done some work on normalizing these alignments via my backend.