Microsoft Won't Let You Use Its New AI Voice Tool

It's no secret that AI is getting pretty darn realistic: Companies like OpenAI are making tools that can replicate images, audio, and videos in ways that are becoming increasingly more difficult to identify as such on the fly. But while it's bad enough that some of these programs are available to the public already, it's concerning to hear about a tool that's so good, it's being kept from the rest of us.

Vall-E 2 can steal your voice

As reported by TechSpot, Microsoft has created a new version of its "neural codec language model," Vall-E, appropriately now called Vall-E 2. Microsoft detailed Vall-E 2's advances in a blog post, highlighting some key milestones with this latest model. Chiefly, Vall-E 2 achieves "human parity," which seems to be a fancy way of saying, "Our model's outputs sound like real humans." Be afraid.

Vall-E 2 apparently achieves two key enhancements over Vall-E: The new model doesn't have an "infinite loop" issue the original had when processing repeating tokens. The new model accounts for repeating tokens, and thus is able to decode a sample that contains them. In addition, Vall-E 2 shortens the length of a given sequence by grouping codec codes, which Microsoft says both increases interference speed, and skips over issues that arise from modeling long sequences.

If that's all a bit technical, perhaps this won't be: Vall-E 2 improves upon Vall-E in "speech robustness, naturalness, and speaker similarity," and, according to Microsoft, is the first of its class to achieve human parity in these categories. In fact, the company says, "VALL-E 2 can generate accurate, natural speech in the exact voice of the original speaker, comparable to human performance."

It's not just theory

You don't just have to read about Vall-E 2 to believe how good it is: Microsoft offers examples of how Vall-E 2 can take a sample recording of a voice, and replicate it when prompted with new text. The company also provided examples of the model completing a sentence after being given segments of a sample recording, in three, five, and 10-second chunks. This demonstrates the model's ability to take a very short example of a voice, and replicate it with text that doesn't appear in the original sample recording.

There are still plenty of the quirks you'd expect to find with any text-to-speech model (incorrect pronunciations, stuttered speech, etc.) but there's no doubt that the Vall-E 2 examples are not only often realistic, but match the voice of the original sample quite closely. It especially does well when given a longer recording of a voice: If given three seconds of a recording, the output is still impressive, but when given a five or, especially, a 10-second recording, the output can be remarkably realistic.

If you click through the examples yourself, check out how well Vall-E 2 matches the 10-second recording when reciting "My life has changed a lot" under "VCTK Samples." I don't have any experience with training AI systems, but to my ear, the model nails the raspy voice of the speaker in the sample, especially after receiving the full 10-second clip. It's jarring to hear the original speaker reading a certain sentence, then hear the model speak a new sentence in a voice that essentially matches the speaker's.

Vall-E 2's risks

But if you're a bit freaked out by this whole thing, you aren't alone. Microsoft is aware its model could be dangerous if used maliciously: In an ethics statement at the bottom of the post, the company acknowledges that, while Vall-E 2 could be used for a variety of positive tasks, it could also be used to impersonate a specific person. Microsoft says the model is meant to be used with consenting users who understand their voice is being replicated, and that the model should have a protocol to check for consent before processing a request. That said, it doesn't seem like such a protocol actually exists right now, which is likely why Microsoft current has, "no plans to incorporate VALL-E 2 into a product or expand access to the public."

The examples here are based on voice samples the LibriSpeech and VCTK datasets, not from samples Microsoft recorded themselves. As such, as a outside observer, it isn't clear how this model would actually perform if given recordings of, say, President Biden, Elon Musk, or your boss. However, if we assume that Vall-E 2 can generate a realistic output when given a 10-second sample, imagine how realistic its output could be when fed with hours of samples. Couple that with a solid AI video model, and you have the perfect storm for generating misinformation, just in time for election seasons across the globe.