Important Notes
Some precautions and best practices about voice cloning
Instant Cloning/Style Guide Sample Precautions
As stated in Overview, if the voice sample you provide is relatively unique and our AI has not learned similar voices before, it may lead to poor generation results or inability to replicate the voice well.
We recommend that you usually use 10-20 seconds of clear speech audio, and it should not contain any reverberation, echo, background noise for best results. For the quality of audio files, we recommend that you use audio with a source bitrate of 128kbps or above to ensure carrying as complete information as possible.
Professional Cloning Precautions
Before starting your professional cloning, you need to prepare single or multiple audio sample files for cloning.
Audio sample files need to meet the following requirements:
The total duration of all audio sample files added together should be at least 1 minute, at most 60 minutes; within this range, the longer the total duration, the better the cloning effect.
Each audio file needs to be in wav/mp3/mp4 (recommended to convert to audio)/flac/m4a/ogg format.
Please ensure to provide high-quality audio as much as possible, and ensure that the audio contains recognizable sentences (for supported languages, please refer to Model Introduction).At the same time, you need to avoid serious noise, multiple speakers and other interference in the audio.
After the audio sample files are prepared, you can manually select audio files or drag them to the upload box, or package them into unencrypted Zip format compressed packages. The system will automatically organize the sample files. The total size of uploaded files cannot exceed 256MB.
Comprehensive Precautions
Our AI voice model will try to imitate everything it hears in the audio, such as the speaker's tone, speed, accent, breathing method, strength, background noise, vocal noise, hesitation pauses, and everything else. This means that if the sample audio contains relevant information, it may be imitated by AI and expressed in the final synthesis.
In other words, if you speak in a slow, flat voice, the final result will usually be the same; or if you speak in an excited, fast manner, AI will also try to imitate it.
Very importantly, we recommend that you ensure the consistency of voice performance in all aspects as much as possible throughout the voice sample. If the performance in the first 2 seconds of the sample is excited and fast, then the following seconds also need to maintain similar performance as much as possible, including tone, speed, volume and other aspects. If your performance fluctuates too much within the same voice sample, it may confuse AI and produce more unpredictable results each time it generates.
In general:
The voice performance itself, accent, and recording quality will greatly affect the final effect of cloning
For instant cloning, the length of the audio is not that important, but we recommend ensuring at least five seconds to contain enough information
Keep the voice performance and recording quality consistent throughout the audio sample as much as possible, and avoid excessive changes within the same segment
The volume of the audio may also be replicated by AI, so we recommend that you adjust to a reasonable volume balance range to avoid too loud or too soft sound
V2 series models (V2.9) only support Chinese and English. When using V2 series models, please ensure that the input text does not contain any non-Chinese and English characters, such as Japanese and Korean, otherwise it may cause generation failure and other issues.
Starting from V3 series, we have added Cantonese, Japanese, Korean, French, German, Spanish and Portuguese in addition to Chinese and English, as well as more than 30 accent variants of these languages in total. Please ensure that the model version and text content you use are in line with the corresponding support capabilities.
Last updated
Was this helpful?