This AI Clones Your Voice After Listening for 5 Seconds 🤐

This AI Clones Your Voice After Listening for 5 Seconds 🤐


Dear Fellow Scholars, this is Two Minute Papers
with Károly Zsolnai-Fehér. Today we are going to listen to some amazing
improvements in the area of AI-based voice cloning. For instance, if someone wanted to clone my
voice, there are hours and hours of my voice recordings on Youtube and elsewhere, they
could do it with previously existing techniques. But the question today is, if we had even
more advanced methods to do this, how big of a sound sample would we really need for
this? Do we need a few hours? A few minutes? The answer is no. Not at all. Hold on to your papers because this new technique
only requires 5 seconds. Let’s listen to a couple examples. Absolutely incredible. The timbre of the voice is very similar, and
it is able to synthesize sounds and consonants that have to be inferred because they were
not heard in the original voice sample. This requires a certain kind of intelligence
and quite a bit of that. So, while we are at that, how does this new
system work? Well, it requires three components. One, the speaker encoder is a neural network
that was trained on thousands and thousands of speakers and is meant to squeeze all this
learned data into a compressed representation. In other words, it tries to learn the essence
of human speech from many many speakers. To clarify, I will add that this system listens
to thousands of people talking to learn the intricacies of human speech, but this training
step needs to be done only once, and after that, it was allowed just 5 seconds of speech
data from someone they haven’t heard of previously, and later, the synthesis takes
place using this 5 seconds as an input. Two, we have a synthesizer that takes text
as an input, this is what we would like our test subject to say, and it gives us a Mel
Spectrogram, which is a concise representation of someone’s voice and intonation. The implementation of this module is based
on DeepMind’s Tacotron 2 technique, and here you can see an example of this Mel spectrogram
built for a male and two female speakers. On the left, we have the spectrograms of the
reference recordings, the voice samples if you will, and on the right, we specify a piece
of text that we would like the learning algorithm to utter, and it produces these corresponding
synthesized spectrograms. But, eventually, we would like to listen to
something, and for that, we need a waveform as an output. So, the third element is thus a neural vocoder
that does exactly that, and this component is implemented by DeepMind’s WaveNet technique. This is the architecture that led to these
amazing examples. So how do we measure exactly how amazing it
is? When we have a solution, evaluating it is
also anything but trivial. In principle, we are looking for a result
that is both close to the recording that we have of the target person, but says something
completely different, and all this in a natural manner. This naturalness and similarity can be measured,
but we’re not nearly done yet, because the problem gets even more difficult. For instance, it matters how we fit the three
puzzle pieces together, and then, what data we train on, of course, also matters a great
deal. Here you see that if we train on one dataset
and test the results against a different one, and then, swap the two, and…the results
in naturalness and similarity will differ significantly. The paper contains a very detailed evaluation
section that explains how to deal with these difficulties. The mean opinion score is measured in this
section, which is a number that describes how well a sound sample would pass as genuine
human speech. And we haven’t even talked about the speaker
verification part, so make sure to have a look at the paper. So, indeed, we can clone each other’s voice
by using a sample of only 5 seconds. What a time to be alive! This episode has been supported by Weights
& Biases. Weights & Biases provides tools to track your
experiments in your deep learning projects. It can save you a ton of time and money in
these projects and is being used by OpenAI, Toyota Research, Stanford and Berkeley. They also wrote a guide on the fundamentals
of neural networks where they explain in simple terms how to train a neural network properly,
what are the most common errors you can make, and how to fix them. It is really great, you got to have a look. So make sure to visit them through wandb.com/papers
or just click the link in the video description and you can get a free demo today. Our thanks to Weights & Biases for helping
us make better videos for you. Thanks for watching and for your generous
support, and I’ll see you next time!

100 thoughts on “This AI Clones Your Voice After Listening for 5 Seconds 🤐”

  1. Deep fakes should be utterly banned and it use severily prosecuted, it has trememdous value on the entertainment industry but limited everywhere else. This is creating much more harm than good at the moment, there is not a single case in which deep fakes have been used for anything good besides the entertainment industry. This should be treated in the same level as weapons of mass destruction.

  2. The AI also copied the artifacts from the microphone that was being used to record with could become a serious problem if the intention is to use the voice in multiple settings/locations/environments.

  3. sooo if i make funny voices will it copy it or autofix it to my real voice?
    THIS can be the key tech i need to make my own movies!

  4. Ironic and frightening how ‘the Age of Information’ didn’t turn out to be ‘all information freely available to everyone’, but instead became ‘NO information whatsoever is reliable anymore’. This is truly amazing and I understand how you get exited about this as a developer. It is also one more giant step towards the complete devaluation of information. We come to a point pretty soon where NO information is reliable anymore. Fake news, deep fake videos, voice cloning, junk science ……. it will be impossible to tell it from what’s true and real. We will have to rely on AI to tell the difference. Haha. And who will own the AI? Mass manipulation to an unprecedented extent is just around the corner. With no way to fight it. Hold on to your paper books people. Especially the ones from the past century. Read them, save them. Buy an old encyclopedia and keep it as a treasure. There will come a time when ‘old, analog information’ will be essential to keep us free.

  5. I wonder if it can also use this to take really high pitched or really deep voices on the far end of the spectrum on both sides or if it may be harder for the AI to learn and use

  6. Imagine all the data social media has. Imagine machine learning AI with deepfake and voicecloning. We can clone humans in machines. Technology needs some rules. If not tech companies will be the new governments.

  7. Check out the project itself — there are a lot more audio samples when you look at the individual papers the researchers have produced. The record failures in red as well, where sometimes the AI will sound like it’s having a stroke, or it quickly becomes demonic, but recognisably human.

  8. if this doesnt scare you with what horrors people are gonna do with these tools it should, imagine someone hacking into a news broadcast and using a fake trump speech declaring war or anything that could change the world in a negative way and we'd have everyone believe it

  9. We already need masks to defeat facial recognition, now we need voice changers too? All a scammer has to do is record your voicemail greeting to train the AI. Then they call your relatives. It is frighteningly easy to find out all kinds of information about someone using only their name, city and state. You probably say your name on your voicemail and your location can be extrapolated from the area code of your phone number.

  10. A lot of bad things will probably come from it but I'd love to see how it could be used to do things such as reading lines

  11. I know of this happening to one of my mentors (even if the software/A.I. isn't exactly this same one but this video proves that this can happen to anyone). His voice got copied & falsified in giving misleading speeches & he's accused of teaching the wrong thing in my country. Your enemies can ruin you if they get ahold of clips of your voice, so never trust anyone with your sensitive information ever.

  12. I recently watched Project MC2 with my sister and the main role has a tablet with a voice assistant like Siri or Alexa, but she is called A.D.I.S.N and she can perfectly imitate voices of others. That would be so cool. I know some of you may not like that show, but please don't show hate. 🙂

  13. This looks so fake to me, why haven't they made it so we can try it ourselves on their website? And if they have can someone give me a direct link?

    Also jeffery epstein didnt kill himself

  14. How is the near future is gonna look where we can't trust anything any more…
    I mean, only the normies believe the media these days, and the AI simulated cases haven't even begun…

  15. This can be used for fake news or for instant game/film/series dub translation. The second application could save tons of money.

  16. What’s the purpose of doing this? This technology is dangerous. Especially when we have so many dumbass people in the world. This tech has the ability to drive our realities further apart.

  17. Well that's cool.

    Suggestion: After synthesized, have the human also read the text without having heard the AI version yet and present both for us to compare.

  18. 1:00 its very clear here that the synthesized sample doesnt come sufficiently close enough to make me even think that its the same voice. Its sounds very bad, actually.

  19. I'm Russian, sorry for my English. Did you try singing? Does it can repeat?😀 How is it hard in calculus? Can you learn it to use emotions? And the last: Can you make ai that take your voice and remake in someone's real-time?

Leave a Reply

Your email address will not be published. Required fields are marked *