Important Notes

Some important notes and best practices for voice cloning

Instant Cloning/Style Guide Sample Considerations

As mentioned in the Overview, if the voice sample you provide is relatively unique and our AI has never learned similar voices before, it may lead to poor generation results or inability to reproduce the voice well.

Sample quality is more important than length. Noisy samples may produce poor results, so please provide high-quality sample audio whenever possible. Currently, sample audio length needs to be greater than 2 seconds and file size not exceeding 10M. You can use the voice separation/audio noise reduction/voice beautification/loudness standardization functions of CapCut PC version to easily obtain high-quality human voice audio samples from any audio.

We recommend that you usually use 5-8 seconds of clear voice audio, and should not contain any reverb, echo, background noise for best results. For audio file quality, we recommend that you use audio with a source bit rate of 128kbps or above to ensure that as complete information as possible is carried.

Professional Cloning Considerations

Before starting your Professional Cloning, you need to prepare one or more audio sample files for cloning.

Audio sample files need to meet the following requirements:

  • The total duration of all audio sample files combined should be at least 1 minute and at most 60 minutes; within this range, the longer the total duration, the better the cloning effect.

  • Each audio file needs to be in wav/mp3/mp4/flac/m4a/ogg format.

  • Please ensure to provide high-quality audio whenever possible, ensure that the audio contains recognizable Chinese or English sentences, and should not contain any reverb, echo, background noise for best results.

After the audio sample files are prepared, please pack them into an unencrypted Zip format compressed package, and the size of the compressed package must not exceed 256MB.

General Considerations

Our AI voice model will try to imitate everything it hears in the audio, such as the speaker's tone, speed, accent, breathing patterns, intensity, background noise, vocal noise, hesitant pauses, and everything else. This means that if the sample audio contains relevant information, it may all be imitated by the AI and manifested in the final synthesis.

That is to say, if you speak with a slow, bland voice, the final result will usually be the same; or if you speak in an excited, fast manner, the AI will also try to imitate it.

A very important point is that we recommend you ensure the consistency of voice performance throughout the entire voice sample as much as possible. If the performance in the first 2 seconds of the sample is excited and fast, then the subsequent seconds also need to maintain similar performance as much as possible, including tone, speed, volume and other aspects. If your performance fluctuates too much in the same voice sample, it may confuse the AI and produce more unpredictable results each time it generates.

In summary:

  • The voice performance itself, accent, and recording quality will greatly affect the final cloning effect

  • For instant cloning, the length of audio is not that important, but we recommend at least five seconds to contain enough information

  • Try to maintain consistency in voice performance and recording quality throughout the entire audio sample, avoiding excessive changes within the same segment

  • The volume of audio may also be replicated by AI, so we recommend you find a good volume balance range to avoid sounds that are too loud or too soft

Last updated

Was this helpful?