The AI releases of the last year give us an idea that it is not just the low-skill labor jobs that AI is after. If you are an artist, you should definitely be worried — especially, if you are a voice artist. A recently published research paper from Microsoft reveals details about VALL-E, an AI model that can reproduce anyone’s voice from just a three-second voice sample.
Previously, we reported that Chinese company Tencent Music has also been using AI voice for releasing songs in real artist voices — although Tencent claims that it is mostly using its AI engine to produce songs in the voices of legendary singers who are dead, it’s quite possible the engine will become an alternative to human singers for Tencent in the future. After all, no record label in the world would like to spend millions of dollars on human singers, if it has software that can do the same job for free.
Apart from being a major software company, Microsoft also stands as one of the world’s leading gaming companies. Microsoft is also in the process of acquiring Activision Blizzard for over $68 billion. If this deal happens, it will be the biggest-ever video game acquisition in human history. Now you might be wondering what the connection is between Tencent Music’s AI engine, Microsoft’s gaming business, and VALL-E.
VALL-E will raise AI’s voice
Microsoft’s revenue from gaming stood at a whopping $16.23 billion in 2022 alone. The company has released some of the biggest game franchises including Gears of War and Halo, and it definitely spends a lot of money on artists that give voices to the characters in these games.
Unlike Tencent, it doesn’t have to hire singers, but it does hire a lot of voice artists. Now there is no official data about how much Microsoft spends on its voice actors, but the number is definitely big considering the company’s mammoth revenue from gaming. Although it’s all just an assumption, it seems possible that, like Tencent, Microsoft is also planning to employ AI to voice its games in the future.
There could be various other reasons why Microsoft is working on VALL-E. In order to understand those, let’s first understand what this VALL-E is.
VALL-E is basically a neural codec model that is capable of mimicking human voice and the emotional tone that accompanies that voice. It’s not an ordinary voice synthesis software because along with the voice, it also captures the specific style in which a human speaker speaks — and to do that all it needs is a three-second voice sample of the speaker.
So for example, imagine you have a friend Carlos, who speaks such that he always sounds angry. You are an animator who creates short-animated films. Now to voice a character in one of your films, you need Carlos. Unfortunately, Carlos also happens to be that friend who drinks a lot and makes a scene wherever he goes.
You want Carlos’ voice but you can’t take him to the studio for recording. If you were to have access to an AI model like VALL-E, you would be able to voice your character just from a three-second voice sample of Carlos (that you can record even in a car). You won’t need Carlos to come to the studio for recording.
Imagine what a company like Microsoft could do with VALL-E. The team at Microsoft suggests that once fully developed, VALL-E could be adopted for voice-editing and premium-quality text-to-speech applications. In addition to imitating the voice and emotional tone, this neural codec model can also simulate the acoustic environment in its output.
If the input voice sample was taken from a tape recorder, the output sample from VALL-E will have the ambiance of a tape recorder. The authors of the VALL-E research paper wrote:
“VALL-E significantly outperforms the state-of-the-art zero-shot TTS (text-to-speech) system in terms of speech naturalness and speaker similarity. In addition, we find VALL-E could preserve the speaker’s emotion and acoustic environment of the acoustic prompt in synthesis.”
Microsoft’s VALL-E can disrupt everything
A report from Ars Technica mentions that VALL-E is developed using a deep-learning-based audio codec model called EnCodec that was actually released by Meta last year. EnCodec can break down a voice sample into small audio codecs (computer programs that compress or decompress data to make any changes in it) that can be further trained to introduce manipulations in the voice sample.
Moreover, VALL-E has been trained using Libri-light, an open-source audio library curated by Meta. It contains 60,000 hours of audio content (mostly, speeches from over 7,000 speakers) in English (available on LibriVox). Currently, Microsoft’s AI can only mimic voice if it closely matches the audio content on which it is trained.
You can read about VALL-E and check some of its audio samples on GitHub. However, unlike DALL-E mini and ChatGPT, the program is not yet available for public use because of the serious implications audio deepfakes might have. There are people who would love to send each other messages in politician and celebrity voices, but there are also criminals and scammers who could use VALL-E to sow chaos.
Also, there is Microsoft which obviously wouldn’t like its competitors to use its AI voice model for free. The company might even have its own secret plans to shock the gaming industry by using VALL-E as a voice actor in its games.
In the future, Microsoft might use this technology to provide gamers with the choice to use any voice they want for their character. Who knows — maybe you’d be able to make a game character sound like you using VALL-E.
The time has also come for voice actors to consider copyrighting their voices because, with a program like VALL-E, they could be replaced anytime in the future. No matter whether you believe it or not, the AI revolution has begun.
The preprint paper is available on arXiv.