Please add word boundary timing information (in milliseconds) to
assistant_message
and/or
audio_output
. This is to support real time video avatar animation.
Additional nice-to-have: in addition to word boundary timing information, supply oculus visemes and their timings, as supplied by Azure TTS.
Hume EVI2 voices with word boundary timing plus the existing prosody output would put Hume a LONG way ahead of the opposition in this space.