Gradium (https://gradium.ai/), a commercial company offshoot of Kyutai (open source lab), are focusing on emotion (both being able to recognise emotion and also understanding what emotion to use depending on context). I don't think any of their public existing models already does that, but they demoed it pretty impressively at the ai-Pulse conference.
Chatterbox does something like that. For example, if the input is
"so and so," he <verb>
and the verb is not just "said", but "chuckled", or "whispered", or "said shakily", the output is modified accordingly, or if there's an indication that it's a woman speaking it may pitch up during the quotation. It also tries to guess emotive content from textual content, such if a passage reads angry it may try to make it sound angry. That's more hit-and-miss, but when it hits, it hits really well. A very common failure case is, imagine someone is trying to psych themselves up and they say internally "come on, Steve, stand up and keep going", it'll read it in a deeper voice like it was being spoken by a WW2 sergeant to a soldier.