Overview

Here you can learn more information about how to create characters and perform voice cloning, and get various usage tips

You can perform instant voice cloning by creating a character and uploading or recording a short audio sample for it. You can also complete professional voice cloning within 3-60 minutes by providing 1-60 minutes of audio samples.

Subsequently, you can assign these characters to different texts in speech synthesis to let AI use these characters' voices for reading.

Currently, you can create a character through the "Add Character" button on the Character Management page, or use the "Quick Create New Character..." button in the bottom left corner of the Speech Generation page.

Instant Cloning

Instant cloning allows you to clone a voice almost instantly from very short samples. It should be noted that the basic principle of instant cloning is not to create or train a new model based on the provided voice samples, but to let AI reasonably infer and imitate based on the massive data it has learned before. Our model has been trained on a large amount of regular speech, so it should be very effective for most natural speech.

However, our model still has some imperfections. If the voice sample you provide is relatively unique and our AI has never learned similar voices before, it may lead to poor generation results or inability to reproduce the voice well. Currently, for specific introductions, shortcomings and limitations of our various models, please refer to Model Introduction

Sample quality is more important than length. Noisy samples may produce poor results, so please provide high-quality sample audio whenever possible. Currently, sample audio length needs to be greater than 2 seconds and file size not exceeding 10M. You can use the voice separation/audio noise reduction/voice beautification/loudness standardization functions of CapCut PC version to easily obtain high-quality human voice audio samples from any audio.

Professional Cloning

Through professional voice cloning, you only need to provide one minute or longer (up to 60 minutes) voice samples, and our AI will deeply train and learn every detail of the voice samples you provide, including every tone, pronunciation, rhythm, prosody, etc., within 3-60 minutes, achieving top-level cloning synthesis effects that are indistinguishable from the original voice, while retaining all the advanced features of the Vocu speech large model such as language understanding and emotional expressiveness.

Last updated

Was this helpful?