Categories: BusinessInterviewsMobileTech News

Speech Technology And Synthesis With Leigh Clark & Benjamin R. Cowan

Why does Siri sound like Siri – and that we’re instantly able to recognize her voice and that she’s not human? Siri, Ivona, Google Home, and most speech synthesis systems have voices which are based on imitating a neutral citation style of speech and making it sound natural. But, in the real world, our voices convey more emotion and change.

In this article, we will talk about speech synthesis as performance, why the uncanny valley is a bankrupt concept, and how academics can escape from studying corporate speech technology as if it’s been bestowed by God.

Simon Cocking of Irish Tech News interviewed Dr Leigh Clark, a Postdoctoral Research Fellow and Benjamin R. Cowan, Assistant Professor, both of University College Dublin, and Dr Matthew P. Aylett, CSO of CereProc Ltd., on their “Siri, Echo and Performance: You have to Suffer Darling.” Their work and argument about next-gen voice technology were presented at the Association for Computer Machinery’s leading conference on computer-human interaction, ACM CHI (pronounced ‘kai’), in Glasgow in May 2019.

You’re no fan of how speech technology is developed today – coming down hard on “the mimicry objective”. What is speech synthesis? How is most commonly developed today?

Speech synthesis is taking text input and turning it into audio by getting a system to ‘speak’ the words. All computer systems that speak to you (such as Siri, Echo, Google Home etc.) use speech synthesis to convert text input into voice.

In the early days, this was all it had to do, but with computers entering the social domain, the requirement for voices to sound natural and to express themselves using emotion and emphasis has also become important. Often to control non-text features (how you say something rather than what you say) text is “marked up” with commands which instruct the system to, for example, speak more slowly, with a higher pitch, or with a calm voice quality.

To make voices better, a common approach is to record a speaker then use their voice to build the system. The artificial system will sound like this source speaker because it has used that person’s speech data to create the models and build new utterances. To see how well you are doing you can simply compare your artificial utterance with the original. Does it sound as good? Does it sound as natural? Does it sound the same?

But this means that we stop thinking about what the voice is going to be used for and how we are expecting to interact with it. The artificial system is being designed without any concern for user experience. This is what we term the ‘mimicry objective’: if it sounds like the original speaker we are done, we have finished our work and how the voice is used and deployed is a separate problem.

This makes it easy for speech synthesis engineers to evaluate their work, it makes it easy to deploy machine learning approaches (we are trying to copy data we have already collected), and the human voice is something users understand immediately so can be used effectively in systems. However, it also means, as engineers, we avoid the difficult questions on how this technology is used.

What are the limitations of the mimicry objective?

The ‘mimicry objective’ is easy to pursue but also presents some challenges. Without being concerned about the context a system is used in, it is hard to design it well. By pursuing naturalness ‘blindly’, we lose the benefits of “not being real”. For example, non-natural systems are perceived as non-judgmental or having a non-natural voice to makes it clear we are communicating with a computer and not a person.

Finally, mimicry is creepy. The so-called “uncanny valley” is often quoted when an artificial system is close enough to a real system to make it feel weird and disconcerting. The question we must ask ourselves is whether mimicking human voices is a good design objective for creating a positive user experience.

In fact, there is not much evidence it is a good design approach. So, what should we do instead? Designing voices in context is a good starting point. Thinking about what the voice interaction is expected to achieve can be used to help design a speech synthesis system before a voice is created.

Finally, we can also consider how human speakers change their voice for specific purposes, such as in a dramatic performance. When Alec Baldwin satirized Donald Trump, he doesn’t just copy his voice – in fact, his copy is not very close to Trump’s normal voice. Rather it is about what you want to
communicate and how to get there. Mimicry is an important element of this process, but naturalness is not necessarily the final objective.

Your paper states that the mimicry in speech synthesis engine and dialogue systems can be used for evil. Couldn’t that be said of all synthesized speech?

Good artificial natural-sounding mimicked speech can be used to deceive people. But that isn’t necessarily the case, it has to be linked to a desire to deceive. For example, there is a big difference between Alec Baldwin mimicking Donald Trump, and someone impersonating a voice in order to call friends and colleagues and extract confidential information. It is not new either, as very good voice artists can deceive people listening to them if they wish to do so.

But as with “fake news” there is a lot of demand for technologies that can be used to deceive. Speech synthesis can be used for unethical purposes, and always could be, but highly natural mimicked systems can also pretend to be human. For example, cold calling ten thousand people at once and pretending to be someone working at your bank and asking for PINs and login details. Sadly, the scope for unethical applications is greater with modern speech synthesis, but none of this is possible without an unethical human being behind it.

What is the difference between vocal performance and mimicry in speech synthesis? What will that mean for human-computer interaction?

If we consider a human actors’ vocal performance, we find a set of tools and techniques actors use to create a performance.

– Choosing a speaking style: For example, choosing a tense, aggressive speech style to create a tense and aggressive character.

– Expressivity: Being able to alter speech to reflect the content, using emphasis, emotion and timing to enhance the communicative nature of the basic text.

– Interpretation: Using speech style and expressiveness to creatively modify the underlying text to support the actors’ interpretation of a story and character. The use of these techniques is often super-natural, in that they reflect natural speech but “more so”.

For example, over-emphasis of important text or pauses which are longer or shorter than you would find in natural conversations. For speech synthesis performance we have these features, voice quality and expressiveness, as required functionality, together with control which allows an application to control the interpretation.

With speech synthesis, we also have an additional technique sometimes applied to
actors in films, the use of audio post-processing to make a voice have a robotic sound (this is not the same as a poor speech synthesis system, and inspiration for this techniques are many in sci-fiction
and fantasy genres). But the key issue is the process of evaluation, rather than asking if the voice is natural, it is to measure the user engagement and impact of the interaction.

In your experience and opinion, what is needed to move from designing speech synthesis systems for applications like Google and Siri to a performance-based system? What will the difference mean?

We are missing a whole generation of designers that are familiar and comfortable with designing speech-based applications and services. Until recently, design and user experience practitioners avoided speech technology preferring to focus on visual and graphic interaction design.

This is changing. As designers become more familiar with the tropes and styles of voice user interaction, they will also begin disrupting the traditional design of personal assistants. As these systems become more applied and focused on specific services and applications, designers will start to demand more performance-ready voices and systems that can create a performance interpretation. Then rather than a system phoning a hairdresser and pretending to be human, we will have systems that are artificial and proud of it.

If you would like to have your company featured in the Irish Tech News Business Showcase, get in contact with us at Simon@IrishTechNews.ie or on Twitter: @SimonCocking

Jordan Hussain

Next NordVPN Is Creating a New Generation Password Manager »

Previous « Enterprise Ireland publishes Seed & Venture Capital Report 2018

Transition Year Students and Women’s Collective Ireland Participants Graduate from Maynooth University STEM Inclusion Programme

Participants from Women’s Collective Ireland (WCI), Ronanstown, along with 319 Transition Year (TY) students from…

2 hours ago

Accelerator

NovaUCD and CeADAR Open Applications for 2026 AI Ecosystem Accelerator Programme

NovaUCD and CeADAR today announced that they are seeking applications from Irish-based AI start-ups to…

4 hours ago

Building a big ‘time crystal’ on IBM Quantum Heron

Researchers created a large, complex, two-dimensional “time crystal” on an IBM Quantum Heron r2 chip,…

6 hours ago

Tech News

DeepWind, the new deepwater test site for offshore wind,

The European Marine Energy Centre (EMEC) has commenced an 18 month project to advance its…

7 hours ago

Dublin

Microsoft launches 2026 Community Fund for South and West Dublin

Minister of State at the Department of Justice with special responsibility for Migration, and Dublin…

9 hours ago

Business

How the 35% R&D Tax Credit Boosts Ireland’s MedTech R&D and Innovation Pipeline

Ireland’s MedTech sector is one of the country’s standout success stories. Ireland is home to…

1 day ago

More about Irish Tech News

Irish Tech News are Ireland’s No. 1 Online Tech Publication and often Ireland’s No.1 Tech Podcast too.

You can find hundreds of fantastic previous episodes and subscribe using whatever platform you like via our Anchor.fm page here: https://anchor.fm/irish-tech-news

If you’d like to be featured in an upcoming Podcast email us at Simon@IrishTechNews.ie now to discuss.

Irish Tech News have a range of services available to help promote your business. Why not drop us a line at Info@IrishTechNews.ie now to find out more about how we can help you reach our audience.

You can also find and follow us on Twitter, LinkedIn, Facebook, Instagram, TikTok and Snapchat.