Create an AI that speaks human with these five essentials
The spoken word has been around for roughly 100,000 years; the written word, a little over 5,000. As humans, we’ve millennia more practice with—and perhaps a greater affinity to—what’s being said and how we say things than words on a page or screen.
There’s a reason important interactions are usually saved for face-to-face conversation, because there’s a greater emotional connection and impact when you combine voice, tone of voice and body language.
And so, there’s a huge difference between writing for the ears and writing for the eyes.
For example, if I was to speak this sentence, I wouldn’t start it with “for example”. I’m unlikely to break out a word like “affinity” in the opening of this article, either. I probably wouldn’t open my verbal conversation with anything other than “hi”, to be honest.
The differences when writing for the ear instead of the eye are not aren’t usually an issue, and not something most people will even recognise, unless you start writing for conversational AI—like developing your chatbot into a fully fledged digital human for instance.
My work at UneeQ has included writing natural language processing (NLP) which acts as a dialogue tree for digital humans. And “natural” is an important word there. If a digital human doesn’t sound natural, it’s unlikely to convey the emotional connection and conversational experience you want for your business.
So, how do you do that? Here are some key findings, a selection of expert analyses… some things to keep in mind.
What are the benefits of voice (when done well)?
It’s often more concise: normal conversation involves shorter sentences and simpler language.
It helps to simplify complexity. Or, as Deloitte puts it: “advanced voice capabilities allow interaction with complex systems in natural, nuanced conversations” .
It’s more personable and can create a more emotional connection than plain text (there’s more character in voice).
Is better as a brand experience. Great companies have their AI voice as part of their brand (Siri is the biggest example). You can’t add as much “brand” with just text.
One of the biggest benefits of using voice communication instead of (or as well as) text comes from how the audience best takes in information. When expressing themselves, the actual words people say only make up 7% of emotional impact. Much more essential are the tone in which those words are said (38%) and the facial expressions used when saying them (55%).
It all goes towards recreating a more human experience, which many people want more of in a digital world. In this recent study on customer expectations, researchers Gladly found how “[customers now] want the same warmth and seamless experience they expect with human support in their automated support, too.”
So, your communications with customers can’t be robotic and stilted, but many are. Gladly also found that 69% of customers say they’re being treated like a case number, not a human.
Having a human interaction is impossible when businesses write for voice but still use the same language they do for text.
Hopefully that explains the “why”, and you’re now all in and ready to hear the “how”. Fortunately, how you write for the ear is actually quite simple—in many ways, much simpler than writing for the eye.
How to build and AI that speaks human:
As someone who’s spent lots of time writing for digital human speech, and learning from customer conversations, here are five top tips:
1. Use short sentences and statements
In written language, studies show that when a sentence is 14 words in length, the average reader understand more than 90% of it. At 43 words, that drops to less than 10%. It’s even more difficult in speech, where you can’t just re-read what’s being said.
That’s why we naturally tend to use short sentences in speech. With regular micro-breaks in between sentences, the listener gets a couple of seconds to process and comprehend what’s being said.
That also goes for what’s being said. Human-like conversation is rarely one way, and it’s less engaging when it is. So, if you’ve written a response that’s 300 words, you’ll likely find that your listener can’t process it all in one go.
I’d recommend keeping responses well under 100 words, when possible. If it’s not possible, be prepared for:
Your customers to ask the digital human to repeat certain parts of their dialogue, so you’ll have to have this as part of your scripting.
Your customers to interrupt the digital human mid-conversation at the cost of missing potential important info.
Your customers to get disengaged and leave the conversation.
A Plan B: like using on-screen visuals to complement the conversation or links to parts of your website that explain more.
2. Use short, everyday words
Supercalifragilisticexpialidocious—just because the sound of it is something quite atrocious—is terrible for anything but song. Don’t take a leaf out of Mary Poppins’ book, trying to impress her peers with big words no one knows; instead aim for simple language everyone uses.
For one, you want your audience to require as little effort as possible comprehending what’s being said. Short, everyday words are more accessible, with studies showing that the effort required to comprehend speech is significantly greater for older people or those with a hearing impairment.
The use of common words is important, too. More research shows that knowing 2,000 words gives 80% coverage of written text but 96% coverage of informal speech.
When it comes to word length, bigger is rarely better. Words with three syllables or fewer are generally your best bet in spoken word. Four and more syllables are passable when you have no other options, like if you’re naming a certain product or location, or when no other word will do.
Perhaps the most famous and best speech of all time, by Martin Luther King Jnr., demonstrates these rules to a tee—and it’s likely more impactful by the natural clarity of his everyday vocabulary and the low effort needed to process his shorter words.
“I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character...”
Long story short: be like Martin, not like Mary.
3. Don’t fear contractions
I have always said that you should not fear contractions. Does that scan well to read in text? It’s even worse if you read it out loud.
Contractions like she’s (she is), there’s (there is) and wouldn’t (would not) help speech flow more naturally.
Whether you’re writing for a text-only chatbot or for a digital human who will speak to your customers, contractions stop your words coming across as robotic. There is There’s not much more to it than that.
4. Get to the point
Remember at the start of this guide when I started talking about the histories of the spoken and the written word? I can get away with that in text; but if I was to start a conversation with my wife in the same way, she’d be more confused than engaged.
So get to the point. You don’t need a hook when you’re writing for the ears because voice is already more engaging. Just give people what they need.
Apparently, eight out of 10 people will read an article headline, but only two out of 10 will read the rest (so thanks if you’re reading this part). In speech, you have about a minute before people start tuning in or out.
But you shouldn’t take their engagement for granted. If a digital human helps a customer from start to finish within that minute, even better.
5. Read it out loud
Simple and effective. You often don’t know how something will sound until it’s read aloud. Here you’ll understand the rhythm and stresses of the words and sentences you’re saying. If something sounds a little robotic, you’ll soon see why.
For your most important interactions, you can even run through the ideal “happy path” dialogue with another person, to make sure it comes across as conversational as possible.
Simply script your ideal conversation, from start to finish, as it will be put into your NLP and read through it with another person.