This feature request is to provide timing data similar to what ElevenLabs provides:
* Timing information about the timestamps in the audio of when the speech reaches a certain point, at the character level.
* Alignments i.e. -- "lipsync" data that helps in knowing what the shape of the mouth would be at each timestamp if somebody were to speak this audio.
* Possible support for formats like SRT?