Important Notes

Some precautions and best practices about voice cloning

Instant Cloning/Style Guide Sample Precautions

As stated in Overview, if the voice sample you provide is relatively unique and our AI has not learned similar voices before, it may lead to poor generation results or inability to replicate the voice well.

Sample quality is more important than length. Noisy samples may produce poor results. Please provide high-quality sample speech as much as possible. Currently, sample speech length needs to be greater than 2 seconds and file size not exceeding 20M. You can also try to obtain high-quality vocal audio samples from any audio using vocal separation/audio noise reduction/vocal beautification/loudness normalization and other functions of some audio editing software.

We recommend that you usually use 10-20 seconds of clear speech audio, and it should not contain any reverberation, echo, background noise for best results. For the quality of audio files, we recommend that you use audio with a source bitrate of 128kbps or above to ensure carrying as complete information as possible.

Professional Cloning Precautions

Before starting your professional cloning, you need to prepare single or multiple audio sample files for cloning.

Audio sample files need to meet the following requirements:

  • The total duration of all audio sample files added together should be at least 1 minute, at most 60 minutes; within this range, the longer the total duration, the better the cloning effect.

  • Each audio file needs to be in wav/mp3/mp4 (recommended to convert to audio)/flac/m4a/ogg format.

  • Please ensure to provide high-quality audio as much as possible, and ensure that the audio contains recognizable sentences (for supported languages, please refer to Model Introduction).At the same time, you need to avoid serious noise, multiple speakers and other interference in the audio.

After the audio sample files are prepared, you can manually select audio files or drag them to the upload box, or package them into unencrypted Zip format compressed packages. The system will automatically organize the sample files. The total size of uploaded files cannot exceed 256MB.

Comprehensive Precautions

Our AI voice model will try to imitate everything it hears in the audio, such as the speaker's tone, speed, accent, breathing method, strength, background noise, vocal noise, hesitation pauses, and everything else. This means that if the sample audio contains relevant information, it may be imitated by AI and expressed in the final synthesis.

In other words, if you speak in a slow, flat voice, the final result will usually be the same; or if you speak in an excited, fast manner, AI will also try to imitate it.

Very importantly, we recommend that you ensure the consistency of voice performance in all aspects as much as possible throughout the voice sample. If the performance in the first 2 seconds of the sample is excited and fast, then the following seconds also need to maintain similar performance as much as possible, including tone, speed, volume and other aspects. If your performance fluctuates too much within the same voice sample, it may confuse AI and produce more unpredictable results each time it generates.

In general:

  • The voice performance itself, accent, and recording quality will greatly affect the final effect of cloning

  • For instant cloning, the length of the audio is not that important, but we recommend ensuring at least five seconds to contain enough information

  • Keep the voice performance and recording quality consistent throughout the audio sample as much as possible, and avoid excessive changes within the same segment

  • The volume of the audio may also be replicated by AI, so we recommend that you adjust to a reasonable volume balance range to avoid too loud or too soft sound

Last updated

Was this helpful?