Okay can I ask a question that has been bothering me for a long time? Why do see...

bredren · on Jan 13, 2023

I don’t know how many of the solutions offer this, but there is a markup language for TTS:

https://en.wikipedia.org/wiki/Speech_Synthesis_Markup_Langua...

Amazon Polly, (which seems kind of ancient with all these new solutions showing up) has supported SSML for some time.

AWS Polly SSML docs: https://docs.aws.amazon.com/polly/latest/dg/ssml.html

KRAKRISMOTT · on Jan 13, 2023

In practice they are next to useless, the expressions are not very...expressive (just try it in the AWS editor). I suspect a LLM would be able to infer the context or we can use prompt engineering to generate the appropriate tokens encoding emotions for the intermediate neural codecs directly (Mel spectrograms are so passé now post Vall-E).

SequoiaHope · on Jan 13, 2023

Something I always noticed is that they get Morgan Freeman to do voiceovers for science shows, but he’s not a scientist so he has a sort of generic inflection when he talks about the various ideas in the script. And then you watch Carl Sagan’s COSMOS, where he co-wrote the material, and there is so much depth and expression to his delivery. There’s a lifetime of public speaking, specifically delivering complex scientific topics to a general audience, that Sagan drew from when recording his show.

Sagan would have learned this through conversation with people, and careful updates to his expression and delivery as he matured.

I guess an LLM could improve upon previous methods but I would also say there is a gap that even humans struggle with, which requires really complex knowledge both of public speaking and of the material. It may be a long time before we can really master that with AI systems.

slim · on Jan 13, 2023

maybe the only way to express speech precisely is the speech itself ?

matisqe · on Jan 13, 2023

ElevenLabs dev here - we believe this is a 2 step process and agree it is needed!

First, we want to the quality you get out-of-the-box to already by brilliant by taking context into account. Granted, that gets you sometimes 98% there and are working to add manipulation possibility to get you to that 100%; for long-texts though the quality you get is great.

For second part, currently TTS providers give complicated toggles that frequently don't affect the speech in the way you want. Initially we are adding a basic SSML-like support and have a more robust language-based idea which we hope will come over the next few months!

tkgally · on Jan 13, 2023

Your context-aware TTS is already sounding very good. If I were using it to produce a narration that other people would be listening to, I would want to make at most couple of minor adjustments every few sentences. Most of those adjustments would fall into a few categories: stronger or weaker stress on a particular word, rising or falling intonation on a phrase, longer or shorter pauses between words, and correction of the phonemes in a word. A half dozen toggles for those adjustments might be enough for most cases.

I wonder, though, how much training people would need to understand what adjustments need to be made. Experienced actors and narrators should have a good sense of what to fix, but many people might have trouble identifying what sounds strange in the initial TTS output and how it needs to be changed.

spywaregorilla · on Jan 13, 2023

I feel like it would be much harder to create a set of hard controls, like MIDI, to affect the voice acting vs. trying to do a co-embedding space of voices and descriptions of the voices and just saying "Say this quietly and meanly". Thoughts?

matisqe · on Jan 13, 2023

Exactly! Only issue is having a well-labelled dataset with those type of cues. We have an idea on how to do it though!

riceart · on Jan 13, 2023

> I feel like this is a huge unnecessary roadblock holding back this kind of technology.

There are speech synthesis markup languages, like SSML. And targeting even lower level has always been possible with commercial speech engines.

Think about how tedious and time consuming it is to mark up a large amount of copy? Unless we’re talking about little hints here and there (which is also doable) it rapidly becomes more cost effective to just pay for voice talent. For this stuff to be appealing it really must be close to fire and forget.

feoren · on Jan 13, 2023

I think there are two "sweet spots" here.

The first is being able to correct a few things that sound off, as another poster pointed out. "Hey, that's not actually how you pronounce 'synecdoche', it should be 'sɪˈnɛk.doʊ.k'." Or "Less emphasis on the first word, more on the second". Little corrections like that. I imagine a two-stage process where the first generates 'best guess' SSML (or whatever markup) based on the text. Then the content creator can modify it as necessary before it goes into the second step of actual voice synthesis.

The second sweet spot is when your text is dynamically generated. Marking up the entire copy might be a lot of work for pre-written text, but it's a great option for dynamically generated text.

bdhcuidbebe · on Jan 13, 2023

Just my 2 cents, but it seems to me that too little focus in the tech world has been spent on understannding what speech is. Tonality, mood, facial expression and body language all is ignored or people pretend like there are no such thing. I believe this is broadly true in western society by now- people went digital but do not yet realize why communication went to hell in the last decade.

havnagiggle · on Jan 14, 2023

I used to work in automotive navigation. Other colleagues handled our voice systems, but I do remember all our prompts were written in SSML[1] with varrying amounts of specificity. We would use Lua to configure and customize the SSML, including some custom extensions for different voice renderers.

Even with the prompts marked up, there were huge differences between products. Some car OEMs would pay higher fees for better voice and some wouldn't. It's fairly tedious work and difficult to scale as the amount of sentences grows. We basically built up a catalog over many years and they were always explicitly stated as part of our requirements docs. Of course the renderers could say anything you wanted but letting it free form was so a big risk from a product point of view.

[1] https://en.m.wikipedia.org/wiki/Speech_Synthesis_Markup_Lang...

IanCal · on Jan 13, 2023

There's lots of text and audio already without this, that's probably the key factor practically. Similarly then for use cases, converting text that already exists is much more approachable than creating new marked up text.

Tortoise lets you add prompts into the text like [I am angry] which modifies the voice interestingly.

montag · on Jan 13, 2023

There's a pretty advanced Mac OS speech markup language, I wrote about it here: https://www.mattmontag.com/personal/mac-os-x-speech-synthesi...

Going back further, there was also a prosody markup for Sound Blaster speech synthesis (Dr. Sbaitso, anyone?).

IshKebab · on Jan 13, 2023

I think markup would always be more work and less effective than using your own voice input to guide its tone.

wpietri · on Jan 13, 2023

But not nearly as manageable. Imagine saying the same thing about music, for example. Musical notation is clearly more work than just humming a tune, but there's still a need for it.

Roark66 · on Jan 13, 2023

I remember in the 80s of last century there was a speech synthesis software I had on an 8 bit computer that accepted either normal text, or phonetic notation that had extra modifiers for basic things like "make this a question" etc.

jsjohnst · on Jan 13, 2023

Do you remember what that was? Dectalk was around in the 80s and so might’ve been that, but it wasn’t a generally available thing. Dr Sbaitso was common, but that wasn’t until 91/92.

Roark66 · on Jan 14, 2023

Yes I do, it was a Commodore 64 cartridge called "Black Box 8". And it spoke Polish with the right accent with all the sounds not present in English etc.

I read back then that it was domestic Polish make, but back then there was no such thing as IP protection so it is very likely it was based on work of Denic Klatt(same as DECtalk). When I heard some DECtalk recordings in a youtube video not long ago it immediately reminded me nded me of Commodore 64 Black Box 8. Although DECtalk spoke in English and black box 8 spoke in Polish there is some similarity that can be heard in their voices(not pitch - this was a user setting, but more of a rhythm if it makes sense)

jb1991 · on Jan 13, 2023

There are solutions that let you use curves like in an audio program to define inflection and pitch, speed of speaking, etc. Some of the competitors of this post's service do that.

altacc · on Jan 13, 2023

I wonder if it would be possible to automate this by pairing the speech synthesis with a ML model that understands the context of the text it is parsing.

lucasfcosta · on Jan 13, 2023

As a note, there are indeed markup formats to write the phonetic pronunciations, and also allowing everything you mentioned.

It's called SSML.

pmichaud · on Jan 13, 2023

As the sibling comment notes, there is in fact markup for this, and the results are actually pretty great.