Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Okay can I ask a question that has been bothering me for a long time?

Why do seemingly all these text-to-speech programs attempt to produce spoken voice based solely on raw text? Why don't they consume a MIDI-like text-markup language where you can write phonetic pronunciations along with markup about the emotion, volume, speed, etc.? I feel like this is a huge unnecessary roadblock holding back this kind of technology. It'd be like if every music composition program rendered a wave file not by MIDI or VST, but by trying to visually read sheet music. I totally understand why TTS solutions that have to consume arbitrary content, like screen-readers, need to read purely raw text. But content creators don't need to be limited to raw text! Why is everyone doing it that way? Where is the TTS markup language for content creators?



I don’t know how many of the solutions offer this, but there is a markup language for TTS:

https://en.wikipedia.org/wiki/Speech_Synthesis_Markup_Langua...

Amazon Polly, (which seems kind of ancient with all these new solutions showing up) has supported SSML for some time.

AWS Polly SSML docs: https://docs.aws.amazon.com/polly/latest/dg/ssml.html


In practice they are next to useless, the expressions are not very...expressive (just try it in the AWS editor). I suspect a LLM would be able to infer the context or we can use prompt engineering to generate the appropriate tokens encoding emotions for the intermediate neural codecs directly (Mel spectrograms are so passé now post Vall-E).


Something I always noticed is that they get Morgan Freeman to do voiceovers for science shows, but he’s not a scientist so he has a sort of generic inflection when he talks about the various ideas in the script. And then you watch Carl Sagan’s COSMOS, where he co-wrote the material, and there is so much depth and expression to his delivery. There’s a lifetime of public speaking, specifically delivering complex scientific topics to a general audience, that Sagan drew from when recording his show.

Sagan would have learned this through conversation with people, and careful updates to his expression and delivery as he matured.

I guess an LLM could improve upon previous methods but I would also say there is a gap that even humans struggle with, which requires really complex knowledge both of public speaking and of the material. It may be a long time before we can really master that with AI systems.


maybe the only way to express speech precisely is the speech itself ?


ElevenLabs dev here - we believe this is a 2 step process and agree it is needed!

First, we want to the quality you get out-of-the-box to already by brilliant by taking context into account. Granted, that gets you sometimes 98% there and are working to add manipulation possibility to get you to that 100%; for long-texts though the quality you get is great.

For second part, currently TTS providers give complicated toggles that frequently don't affect the speech in the way you want. Initially we are adding a basic SSML-like support and have a more robust language-based idea which we hope will come over the next few months!


Your context-aware TTS is already sounding very good. If I were using it to produce a narration that other people would be listening to, I would want to make at most couple of minor adjustments every few sentences. Most of those adjustments would fall into a few categories: stronger or weaker stress on a particular word, rising or falling intonation on a phrase, longer or shorter pauses between words, and correction of the phonemes in a word. A half dozen toggles for those adjustments might be enough for most cases.

I wonder, though, how much training people would need to understand what adjustments need to be made. Experienced actors and narrators should have a good sense of what to fix, but many people might have trouble identifying what sounds strange in the initial TTS output and how it needs to be changed.


I feel like it would be much harder to create a set of hard controls, like MIDI, to affect the voice acting vs. trying to do a co-embedding space of voices and descriptions of the voices and just saying "Say this quietly and meanly". Thoughts?


Exactly! Only issue is having a well-labelled dataset with those type of cues. We have an idea on how to do it though!


> I feel like this is a huge unnecessary roadblock holding back this kind of technology.

There are speech synthesis markup languages, like SSML. And targeting even lower level has always been possible with commercial speech engines.

Think about how tedious and time consuming it is to mark up a large amount of copy? Unless we’re talking about little hints here and there (which is also doable) it rapidly becomes more cost effective to just pay for voice talent. For this stuff to be appealing it really must be close to fire and forget.


I think there are two "sweet spots" here.

The first is being able to correct a few things that sound off, as another poster pointed out. "Hey, that's not actually how you pronounce 'synecdoche', it should be 'sɪˈnɛk.doʊ.k'." Or "Less emphasis on the first word, more on the second". Little corrections like that. I imagine a two-stage process where the first generates 'best guess' SSML (or whatever markup) based on the text. Then the content creator can modify it as necessary before it goes into the second step of actual voice synthesis.

The second sweet spot is when your text is dynamically generated. Marking up the entire copy might be a lot of work for pre-written text, but it's a great option for dynamically generated text.


Just my 2 cents, but it seems to me that too little focus in the tech world has been spent on understannding what speech is. Tonality, mood, facial expression and body language all is ignored or people pretend like there are no such thing. I believe this is broadly true in western society by now- people went digital but do not yet realize why communication went to hell in the last decade.


I used to work in automotive navigation. Other colleagues handled our voice systems, but I do remember all our prompts were written in SSML[1] with varrying amounts of specificity. We would use Lua to configure and customize the SSML, including some custom extensions for different voice renderers.

Even with the prompts marked up, there were huge differences between products. Some car OEMs would pay higher fees for better voice and some wouldn't. It's fairly tedious work and difficult to scale as the amount of sentences grows. We basically built up a catalog over many years and they were always explicitly stated as part of our requirements docs. Of course the renderers could say anything you wanted but letting it free form was so a big risk from a product point of view.

[1] https://en.m.wikipedia.org/wiki/Speech_Synthesis_Markup_Lang...


There's lots of text and audio already without this, that's probably the key factor practically. Similarly then for use cases, converting text that already exists is much more approachable than creating new marked up text.

Tortoise lets you add prompts into the text like [I am angry] which modifies the voice interestingly.


There's a pretty advanced Mac OS speech markup language, I wrote about it here: https://www.mattmontag.com/personal/mac-os-x-speech-synthesi...

Going back further, there was also a prosody markup for Sound Blaster speech synthesis (Dr. Sbaitso, anyone?).


I think markup would always be more work and less effective than using your own voice input to guide its tone.


But not nearly as manageable. Imagine saying the same thing about music, for example. Musical notation is clearly more work than just humming a tune, but there's still a need for it.


I remember in the 80s of last century there was a speech synthesis software I had on an 8 bit computer that accepted either normal text, or phonetic notation that had extra modifiers for basic things like "make this a question" etc.


Do you remember what that was? Dectalk was around in the 80s and so might’ve been that, but it wasn’t a generally available thing. Dr Sbaitso was common, but that wasn’t until 91/92.


Yes I do, it was a Commodore 64 cartridge called "Black Box 8". And it spoke Polish with the right accent with all the sounds not present in English etc.

I read back then that it was domestic Polish make, but back then there was no such thing as IP protection so it is very likely it was based on work of Denic Klatt(same as DECtalk). When I heard some DECtalk recordings in a youtube video not long ago it immediately reminded me nded me of Commodore 64 Black Box 8. Although DECtalk spoke in English and black box 8 spoke in Polish there is some similarity that can be heard in their voices(not pitch - this was a user setting, but more of a rhythm if it makes sense)


There are solutions that let you use curves like in an audio program to define inflection and pitch, speed of speaking, etc. Some of the competitors of this post's service do that.


I wonder if it would be possible to automate this by pairing the speech synthesis with a ML model that understands the context of the text it is parsing.


As a note, there are indeed markup formats to write the phonetic pronunciations, and also allowing everything you mentioned.

It's called SSML.


As the sibling comment notes, there is in fact markup for this, and the results are actually pretty great.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: