The use case of using this endpoint as the one way to pipe user mic data in and get everything you need including a raw transcription.
The output should have both text transcribed, Word or character level timing boundaries, and the emotional dictionary for each word.
It also would be good to support headless pcm data that can be sent in bytes as they arrive. And not have to add headers to each chunk.
It should also detect emotion in between spoken words like crying or laughing.