I’ve seen a lot of buzz about Lyrebird, a potentially revolutionary AI startup that claims to analyze a person’s voice DNA and mimic them saying anything using only a 1-minute audio sample. The really cool demo below recognizably spoofs Barack Obama, Hillary Clinton, and Donald Trump:
Before we get too excited… (too late!)
So here’s a problem that no one is talking about: there is no such thing as “voice DNA.” I felt dirty using it in the opening paragraph, but it was to help you realize how easy it is to be led astray. “Voice DNA” is a completely made-up term. There are no scientific studies about voice DNA, and there doesn’t even appear to be a consensus on what the fundamental components of the human voice are. The problem with comparing what Lyrebird does to DNA is that we (i.e., regular folks) are prone to wrongly imbue the technology with the full weight of scientific rigor and hard-earned clout of decades worth of hard genetics research, when, in reality, voice DNA is not a thing.
Allow me to simultaneously coin and avoid using the diminutive term “Liarbird” (ooohhh, snap!). I don’t believe that Lyrebird explicitly intended for “voice DNA” to deceive anyone, but rather the term is an imperfect vehicle for explaining their process, which is roughly as follows:
- Analysis of a vocal sample identifies a set of distinctive features of the sample (unclear if this analysis uses AI, but probably).
- These features provide a set of parameters used by an AI-powered text-to-speech process to generate new sounds.
- Because the newly generated speech shares some features of the original sample, it will share some recognizable similarities with the voice from the sample.
The universality and completeness of these features are where the DNA analogy falls tragically short, and it bears a lesson in approximation that I’ll save for another day. In truth, our common concept of “voice” itself is a substantial simplification of reality, and there is a vast, cavernous gulf between a recognizable imitation and an indistinguishable one.
Voice uniqueness, human vocal capacity, and Carson Daly
We take for granted that each of us has a much larger vocal range than we use. Some people do realize this fact and leverage it into a career by doing hilarious impressions. While tech blogs suddenly seem very worried that a new AI might be used to hijack President Trump’s voice to generate #fakenews, they appear utterly unconcerned about the fact that there are thousands of human beings out there who can already do a better Trump impression than an AI ever will! They shouldn’t be concerned about the former for the same reasons that they aren’t concerned about the latter.
There are many cues that we listen for to recognize a voice. I’m an amateur vocal impressionist myself – just the other night I caught a mixed look of confusion and horror on my wife’s face when she noticed me repeating back every word Carson Daly was saying on the The Voice under my breath (btw, I promise I wasn’t aware of the coincidence in the title of the show until LATE in the editing process!) I do this imitation exercise unconsciously whenever I hear something distinctive in a person’s speaking style that’s easy for me to copy.
Carson Daly, by the way, speaks out of the right side of his mouth, has an unusual squawky quality to his voice, and often delivers his practiced lines with a pattern of high pitch at the beginning which fades to a low pitch at the end, with the last few words being slow and staccato.
I carefully pointed out that this is how he delivers his practiced lines – these observations only describe his voice in a very limited context. Suffice it to say that the Carson Daly impression that I have refined from watching The Voice has only a limited utility and range because of the limited sample from which I’m taking it. This brings us to a basic principle of AI: you are only as good as your data.
The situational dependence of Carson’s voice applies to everyone. We all have what we like to think of as our “regular speaking voice,” but the truth is that this is an illusion that helps us understand the world. The way we speak is (1) not set in stone (unlike DNA) and (2) changes by context – where we are, who we are speaking to, and many other factors (again unlike DNA).
Neat facts about your voice
Let me tell you an awesome secret: You can change your regular speaking voice with practice, and there are whole books devoted to helping you do that. You can change your accent, a common (and commonly achieved) goal of people learning a foreign language. You already have different voices that you use for special situations, like a parent’s discipline voice or the heavily projected voice one uses when speaking in public. A good public speaker will purposefully vary their vocal patterns with intonations, pauses, and intensities that are wildly different than how they would talk to a friend on the phone. The entire career of professional singers is predicated on making their voice do new things that it couldn’t previously do!
This should not be surprising! Your voice is generate by moving the muscles in your body to manipulate airflow. Just like all other muscles, you can exert conscious control to the point that you actually cause these muscles to change. Your voice also undergoes natural variations as you age – exactly what part of this incredibly broad spectrum of variations is actually your voice? In some sense, we are inclined to say all of them, while on the other hand the answer may just as well be none of them. (Mental Floss has a good piece on what makes your voice sound the way it does, by the way).
Let me tell you another awesome secret about your voice: you cannot say the same phrase exactly the same way twice, even if you want to! You can get close enough to fool a human ear, but, in truth, the complexities of the physical universe dictate that you will never be capable of exactly replicating a specific pattern of your own voice. In this (admittedly weird) sense, even you can only approximately sound like yourself!
Circling back to Lyrebird
Let’s condense some of what we’ve covered:
- Your voice emerges from habit – it is not a fixed, unchanging thing like DNA.
- Your voice changes dramatically with social context.
- Your voice frequently varies by conscious choice.
- Your voice varies with age.
- It is not at all intuitive to believe that there is such a thing as “vocal DNA” that can be used to copy someone’s voice with the precision and definiteness of actual DNA.
With all of this swirling in your mind, feel free to listen back to Lyrebird’s advertisement again. Notice how much Obama’s voice sounds like he is giving a campaign speech and realize that the whole reason you recognize that as “sounding like Obama” is because he has a very unique way of delivering speeches. It is an easy style to copy because it is so deliberate and unusual – people don’t normally talk that way!
And voila, by getting a small number of factors right (timing, mainly), a perceived similarity is created between the artificial and the actual. While we’re at it, I might as well talk a bit about…
Perception and the boundaries between real and fake speech
We are naturally hyper-primed to recognize a familiar voice even through heavy distortion. Part of the magic of the human brain is that we are great pattern recognizers – once we perceive a meaningful signal, we can often find it even if it’s not there. Here is an AWESOME demonstration courtesy of Gizmodo:
There is a hard limit to how good an imitation of a specific human voice can get. The most accurate approximation of a human voice will come from a real human mouth and vocal cords. Even if you play back a recording of a voice, it will sound different than the actual voice because of information loss due to (1) the method used to encode/decode the audio and (2) the capabilities of the hardware it is played back on (think about theater surround sound vs. free airplane headphones.)
Our ears are not the only tools we have for validating the authenticity of a voice. Are we entering a new era where, just as we can’t trust our eyes because of Photoshop, we can’t trust our ears because of Lyrebird? Maybe, but the fact of that matter is that, despite just how good Photoshop is after 30 years of development, we still have ways to determine whether a picture is edited at all, let alone completely faked. Any increase in technology leads to a commensurate increase in countermeasures, which are sometimes exceedingly simple.
What’s being reported
Fine, I’ll admit that my title was, ironically, deceptive. I don’t think that people are knowingly lying about Lyrebird. Rather, I graciously assume that they’ve innocently and unwittingly used imprecise language that exaggerates its capabilities. And maybe on top of that they’ve also reached false conclusions by reasoning from poor assumptions about what AI is and how this stuff works. And maybe also they’re so enraptured with the notion of a Skynet-like AI-pocalypse that they can’t separate fantasy from reality.
Whatever the reason, we deserve journalism that does better than leading with that scene from Terminator where Arnold uses perfect voice mimicking to gain someone’s trust so he can murder them. And also journalism that doesn’t claim that Lyrebird can “steal your voice” to make you say anything. And that doesn’t fear-monger you with the zeitgeist of #fakenews. And that doesn’t insist that we naive humans will be completely helpless to combat this emerging all-powerful technology. All of these AI-reporting sins (and more!) can be found in the top Google results for “Lyrebird AI.”