Skip to the content.

VChangeCodec: A High-efficiency Neural Speech Codec with Built-in Voice Changer for Real-time Communication

anonymous authors

Abstract

Neural speech codecs (NSCs) enable high-quality real-time communication (RTC) at low bit rates, making them efficient for bandwidth-constrained environments. However, customizing or modifying the timbre of transmitted voices still relies on separate voice conversion (VC) systems, creating a gap in fully integrated systems that can simultaneously optimize efficient transmission and streaming VC with no additional latency. In this paper, we propose a high-efficiency VChangeCodec, which integrates the Voice Changer model directly into the speech Codec. This design seamlessly switches between the original voice mode and customized voice change mode in real-time. Specifically, leveraging the target speaker’s embedding, we incorporate a lightweight causal projection network within the encoding module of VChangeCodec to adapt timbre at the token level. These adapted tokens are quantized and transmitted to the decoding module, to generate the converted speech of the target speaker. The integrated framework achieves an ultra-low latency of just 40 ms and requires fewer than 1 million parameters, making it ideal for RTC scenarios such as online conferencing. Our comprehensive evaluations, including subjective listening tests and objective performance assessments, demonstrate that VChangeCodec excels in timbre adaptation capabilities compared to state-of-the-art (SOTA) VC models. We are confident that VChangeCodec provides an efficient and flexible framework for RTC systems, tailored to specific operator requirements.

Demo of original voice mode

“Ref” denotes the reference speech. We have provided samples compressed by various codecs, including both signal processing-based and neural-based methods. In these comparisons, “VChangeCodec@9.5kbps” specifically refers to our original voice model.

Demo of female

English

Ref Opus@6kbps Lyra2@6kbps EVS@7.2kbps Opus@8kbps VChangeCodec@9.5kbps Lyra2@9.2kbps EVS@9.6kbps Opus@10kbps Opus@16kbps

Demo of male

English

Ref Opus@6kbps Lyra2@6kbps EVS@7.2kbps Opus@8kbps VChangeCodec@9.5kbps Lyra2@9.2kbps EVS@9.6kbps Opus@10kbps Opus@16kbps

Demo of female

Mandarin

Ref VChangeCodec@9.5kbps DAC@8kbps Encodec@12kbps SpeechTokenizer

Demo of male

Mandarin

Ref VChangeCodec@9.5kbps DAC@8kbps Encodec@12kbps SpeechTokenizer

Demo of voice changer mode

Demo of target timbre 1

The “target label” denotes the actual target speech signal. “Ref” stands for reference timbre of target speaker. “source” is the input speech signal. “Original” refers to our original voice mode, while “VChangeCodec” refers to our voice changer mode.

Please slide the mouse to select different files for listening.

English

Target label Ref Source Original VChangeCodec FACodec DDDM-VC QuickVC DiffVC VQMIVC

Mandarin

Target label Ref Source Original VChangeCodec FACodec DDDM-VC QuickVC DiffVC VQMIVC

Demo of target timbre 2

The “target label” denotes the actual target speech signal. “Ref” stands for reference timbre of target speaker. “source” is the input speech signal. “Original” refers to our original voice mode, while “VChangeCodec” refers to our voice changer mode.

Please slide the mouse to select different files for listening.

English

Target label Ref Source Original VChangeCodec FACodec DDDM-VC QuickVC DiffVC VQMIVC

Mandarin

Target label Ref Source Original VChangeCodec FACodec DDDM-VC QuickVC DiffVC VQMIVC