Overview

By reading our guides and technical documentation, you can gain a deeper understanding of VOCU's various features

This guide will help you get started with Vocu.ai. We will guide you step by step from account registration, to cloning your first voice, to generating your first speech. We will also guide you on how to provide AI with better audio and text prompts to improve overall generation quality, and introduce you to our current shortcomings and limitations.

First, we can start with Voice Management, where you can create characters and add any audio samples for voice cloning, and set names and descriptions for them. After you add characters, you can go to the Speech Generation page, where you can use the character voices you created to generate your first speech.

How AI Models Work

Our VOCU Speech Large Model has been pre-trained on massive amounts of Chinese audio data, covering various types of content, with the majority being audiobooks and regular conversational audio. If the cloning audio samples and target text you provide are of these types, you will typically achieve better results when generating speech. Our model will try its best to mimic the tone, speed, emotion, pauses, loudness, acoustic environment, breathing sounds, accent, and vocalization style of the cloning audio samples, understand the context of the target text as much as possible, and synthesize them to produce the most matching speech.

Shortcomings and Limitations

The current version of the speech model (V2.9) already has human-like speech generation capabilities, but it's still not perfect. You may encounter the following issues during use:

Occasional unstable results: You may occasionally encounter some lower-quality generation results. You can improve global stability by setting the generation style to stable, but this will reduce the probability of producing richer expressiveness. You can also try generating the same text multiple times to get better results.
English content stability or quality may be lower than Chinese: The current version of the model supports bilingual cloning and synthesis in Chinese and English, but English support is still in an experimental stage, so the performance of English content cloning and synthesis may be slightly lower than Chinese content.
Not very good at overly exaggerated, sharp, or overly unique cloning samples: When using overly exaggerated, sharp, or overly unique cloning samples, you may encounter problems with decreased audio quality/similarity/stability. You can try to improve this by generating single sentences multiple times and using your most satisfactory generation result as a sample for cloning.

Our upcoming speech large model version (V3.0) has been specifically optimized for the above issues and is expected to significantly improve these problems. Please stay tuned.

NextQuick Start

Last updated 2 months ago

Was this helpful?