VChangeCodec: An ultra Low-Complexity Neural Speech Codec with Built-in Voice Changer for Customized Real-time Communication
anonymous authors
Abstract
We present VChangeCodec, an ultra low-complexity neural speech codec with built-in timbre adaptation capabilities designed for customized real-time communication (RTC). Unlike cascaded pipelines that combine neural speech codecs (NSCs) with separate voice conversion (VC) systems, which have high latency, VChangeCodec integrates VC into the NSC itself and can seamlessly switch between original and voice change modes. Specifically, we design the NSC by proposing a fully causal convolutional network with scalar quantization to generate compact tokens. In addition, we introduce a lightweight causal projection network that adapts the tokens with target speaker embeddings for decoding into the converted voice, enabling seamless timbre conversion without the need for an additional VC system. Experiments show that, compared to the state-of-the-art speech codec, VChangeCodec reduces parameters by 96.3\%, while maintaining competitive speech quality. More importantly, while the cascaded pipeline suffers from more than 100 ms latency, VChangeCodec achieves integrated timbre customization with only 40 ms, making it well-suited for RTC scenarios.
Demo for customized real-time communication
We demonstrate real-time voice call functionality on mobile devices, where users can keep communication in original voice mode and seamlessly switch to a voice changer mode (a pre-defined timbre) for customized voice transmission.
Users can switch the voice change mode at any time according to their preferences on the sender side and generate adapted timbre tokens. The new token is fed to the decoding networks at the receiver side to reconstruct the altered speech without any compatibility changes on the decoding side. VC is a built-in part of the voice communication module, with internal configurations managed by operators, ensuring users cannot arbitrarily modify settings and thus minimizing privacy risks.
Demo of original voice mode
“Ref” denotes the reference speech. We have provided samples compressed by various codecs, including both signal processing-based and neural-based methods. In these comparisons, “VChangeCodec@9.5kbps” specifically refers to our original voice model.
Demo of female
English
Ref | Opus@6kbps | Lyra2@6kbps | EVS@7.2kbps | Opus@8kbps | VChangeCodec@9.5kbps | Lyra2@9.2kbps | EVS@9.6kbps | Opus@10kbps | Opus@16kbps |
---|---|---|---|---|---|---|---|---|---|
Demo of male
English
Ref | Opus@6kbps | Lyra2@6kbps | EVS@7.2kbps | Opus@8kbps | VChangeCodec@9.5kbps | Lyra2@9.2kbps | EVS@9.6kbps | Opus@10kbps | Opus@16kbps |
---|---|---|---|---|---|---|---|---|---|
Demo of female
Mandarin
Ref | VChangeCodec@9.5kbps | DAC@8kbps | Encodec@12kbps |
---|---|---|---|
Demo of male
Mandarin
Ref | VChangeCodec@9.5kbps | DAC@8kbps | Encodec@12kbps |
---|---|---|---|
Demo of voice changer mode
Demo of target timbre 1
The “target label” denotes the actual target speech signal, which is generated by RVC model. “Ref” stands for reference timbre of target speaker. “source” is the input speech signal. “Original” refers to our original voice mode, while “VChangeCodec” refers to our voice changer mode.
Please slide the mouse to select different files for listening.
English
Target label | Ref | Source | Original | VChangeCodec | FACodec | DDDM-VC | QuickVC | DiffVC | VQMIVC |
---|---|---|---|---|---|---|---|---|---|
Mandarin
Target label | Ref | Source | Original | VChangeCodec | FACodec | DDDM-VC | QuickVC | DiffVC | VQMIVC |
---|---|---|---|---|---|---|---|---|---|
Demo of target timbre 2
The “target label” denotes the actual target speech signal, which is generated by RVC model. “Ref” stands for reference timbre of target speaker. “source” is the input speech signal. “Original” refers to our original voice mode, while “VChangeCodec” refers to our voice changer mode.
Please slide the mouse to select different files for listening.
English
Target label | Ref | Source | Original | VChangeCodec | FACodec | DDDM-VC | QuickVC | DiffVC | VQMIVC |
---|---|---|---|---|---|---|---|---|---|
Mandarin
Target label | Ref | Source | Original | VChangeCodec | FACodec | DDDM-VC | QuickVC | DiffVC | VQMIVC |
---|---|---|---|---|---|---|---|---|---|