Model Introduction
Here you can learn about our various models and their related advantages and disadvantages
Our VOCU Speech Large Model has been pre-trained on massive amounts of Chinese audio data, covering various types of content, with the majority being audiobooks and regular conversational audio. If the cloning audio samples and target text you provide are of these types, you will typically achieve better results when generating speech. Our model will try its best to mimic the tone, speed, emotion, pauses, loudness, acoustic environment, breathing sounds, accent, and vocalization style of the cloning audio samples, understand the context of the target text as much as possible, and synthesize them to produce the most matching speech.
VOCU Speech Large Model V2.9
Released on March 1, 2025, currently the latest version, previous V2.X characters have been automatically upgraded to this compatible version
The latest version of the Vocu speech large model, this version introduces a large number of features and improvements from the V3 version model under development, achieving globally SOTA level performance in Chinese speech content generation. Updates compared to previous versions include:
Significantly improved audio quality generation in non-Flash model (i.e., high-quality mode) and solved the long-standing problem of possible current edge sensation in generation results.
Significantly improved character similarity and character stability in non-Flash model (i.e., high-quality mode), and can largely restore the acoustic environment in the character's original audio samples (such as spatial sense, reverb, volume, recording texture, etc.).
Added a new world's first character voice mixing capability, which can freely specify audio samples from multiple different characters and freely mix them according to proportions to create new character voices. (In internal testing, will be gradually opened)
Added a new world's first character style mixing capability. When creating new emotional styles, you can separately specify style samples and character samples for fusion to create new character emotional expressions; for example, you can fuse the style of a crosstalk performer with a little girl character to create a new emotional expression for this little girl character when telling crosstalk. (In internal testing, will be gradually opened)
Added a new zero-threshold character intelligent dubbing/cover capability, which can directly let existing designated characters re-dub completed generation results, or dub or cover any voice or song audio content you provide, while retaining many style characteristics of the character in this process, bringing you a freer new audio creation experience. (In internal testing, will be gradually opened)
Currently, the generation point consumption for this version model characters is 1 point/character
VOCU Speech Large Model V2.5
Released on November 26, 2024, deprecated, corresponding characters have been automatically upgraded to the latest version of the V2.X series
The second official version of our V2.X series speech large model, this version introduced new hyperparameters and training strategies, further improving the naturalness, prosody and emotional expression of generation results compared to previous versions, and to some extent improving character similarity, long content stability, and English content generation performance.
VOCU Speech Large Model V2.1
Released on August 16, 2024, deprecated, corresponding characters have been automatically upgraded to the latest version of the V2.X series
The first official version of our V2.X series speech large model, this version has significant improvements in natural emotional expression, generation effects, stability, instant cloning similarity and other effects compared to previous versions, and brings faster generation speed and higher audio quality, improved English generation effects, and improved and added the following capabilities:
Dialect Support: Benefiting from Vocu's super-strong human voice understanding ability, we can now initially provide support for some types of dialect accents, including Henan dialect, Northeast dialect, Chongqing dialect and other Mandarin-type dialect accents and a few non-Mandarin pronunciations.
Low-latency Playback: The V2.1 version Flash model now supports starting playback of generation results within as short as 1 second, with no text length limit, meeting various low-latency real-time needs. Select "low-latency mode" when using on the web.
More Refined Instant Cloning Capability: In instant cloning mode, V2.1's understanding of longer samples has improved 4 times compared to V1.0, and can more deeply imitate various expressions contained in longer samples.
Better Long Context Understanding: V2.1's understanding of longer generation text has improved 3 times compared to V1.0, can understand more text at once, and generate more fitting and coherent sound performance.
Websocket Millisecond Generation: We have added a new Websocket generation channel for developers, which can achieve streaming generation requests and result returns, and the generation delay can be as low as 500ms, which is sufficient to meet various high real-time requirements.
Faster Professional Cloning Speed: The time required for professional cloning has been greatly shortened. For 30-minute samples, cloning tasks can be completed within 3-5 minutes.
VOCU Speech Large Model V2.0Beta-3
Released on July 8, 2024, deprecated, corresponding characters have been automatically upgraded to the latest version of the V2.X series
The third test version of our V2.X series speech large model, which has the following improvements compared to the second test version:
The current noise problem has been greatly improved, and now obvious current noise should not be perceived for most timbre samples
Stability has been greatly improved, and now the single generation stability performance for long and complex content should be greatly improved
Emotional prosody performance has been greatly improved, and now the emotional performance for non-too-bland samples should be significantly improved. It is recommended to use texts containing modal particles and colloquial expressions for the best experience
English performance has been greatly improved and has basically reached a usable state
Flash model streaming generation delay reduced by 50%+, and can be fixed within 500ms - 1 second to obtain playable generation results when resources are sufficient
Adopting new technical strategies, concurrent carrying capacity has been greatly improved, which should effectively improve the congestion problems caused by increased recent usage
For detailed update content compared to V1.0 version, please refer to the V2.1 official version introduction.
VOCU Speech Large Model V2.0Beta-2-Flash
Released on June 25, 2024, deprecated with V2.0Beta-2, subsequent versions all include corresponding Flash models synchronously
The first low-latency Flash branch version of Vocu speech large model, derived from V2.0Beta-2 and mutually compatible, bringing low-latency streaming playback experience during generation, but audio quality will be reduced compared to the main model; when resources are sufficient, content of any length can achieve playable generation results within 1-2 seconds after task submission through this Flash model.
VOCU Speech Large Model V2.0Beta-2
Released on June 18, 2024, deprecated, corresponding characters have been automatically upgraded to the latest version of the V2.X series
The second test version of our V2.X series speech large model, with further improvements in stability, usability, audio quality and other aspects compared to the first test version. For detailed update content compared to V1.0 version, please refer to the V2.1 official version introduction.
VOCU Speech Large Model V2.0Beta-1
Released on June 10, 2024, deprecated, corresponding characters have been automatically upgraded to the latest version of the V2.X series
The first test version of our V2.X series speech large model. This version's generation effects, stability, instant cloning similarity, generation speed and audio quality have been significantly improved compared to V1.0 version, but there were still many problems at that time. For detailed update content compared to V1.0 version, please refer to the V2.1 official version introduction.
VOCU Speech Large Model V1.0
Released on January 11, 2024, now discontinued, does not support creating new V1.0 characters; existing V1.0 instant cloning characters can continue to generate or upgrade to V2.X version with one click.
Our first officially released model version, which can understand text context to a certain extent, and generate human voice audio based on text with almost the same expressiveness, emotion, prosody and timbre as real people, and supports instant voice cloning with extremely short samples. This version of the model also brings experimental support for English speech synthesis and cloning, but currently the stability and expressiveness may be worse than Chinese.
Currently, the point consumption of this model is 1 point/character
.
VOCU Speech Large Model V0.9Beta
Released in November 2023, deprecated, corresponding characters have been automatically upgraded to the latest version of the V1.X series
Our first publicly released experimental speech large model, and also the world's first generative speech large model with Chinese localized natural expression. This model can generate speech with near-human speech rate, intonation and tone, and can better imitate emotional changes, making AI closer to humans, and supports instant voice cloning technology. Currently only supports Chinese.
This version of the speech model (V0.9) is still in the early testing stage and has many known issues.
Last updated
Was this helpful?