Realtime Transformation Of Voice Identity And Style

Tech ID: 34328 / UC Case 2026-055-0

Patent Status

Patent Pending

Brief Description

Converting voice identity in real-time while maintaining perfect linguistic clarity and emotional nuance is a significant hurdle in speech synthesis. Researchers at UC Berkeley have developed a system for real-time voice style conversion that transforms a source speaker's speech to match the timbre, accent, and emotion of a target speaker. The technology utilizes a content extraction network with conformer blocks and a unique low-dimensional quantization method—using fewer than 100 levels—to preserve linguistic fidelity. By extracting continuous representations before quantization, the system maintains higher speech quality than traditional discrete methods. A diffusion-based generation network then creates a mel-spectrogram conditioned on these features and a target style embedding, which is finally converted to audio via a vocoder. The system is designed for streaming operation through the use of chunked-causal attention mechanisms, enabling near-instantaneous transformation.

Suggested uses

Entertainment and Gaming: Allowing players or actors to adopt the voices of specific characters in real-time with full emotional expression.
Localization and Dubbing: Converting the voice of a foreign-language speaker to a target voice while preserving their original acting performance and accent.
Call Center Personalization: Harmonizing the vocal timbre of agents to a specific brand identity while maintaining their natural speech patterns.
Assistive Communication: Helping individuals with speech impairments or vocal cord damage communicate in their original voice or a chosen identity.
Privacy and Anonymization: Protecting the identity of speakers in sensitive contexts by transforming their voice to a consistent, non-identifiable target.

Advantages

Real-Time Streaming: Integrated chunked-causal attention allows for low-latency processing, making it suitable for live conversations and interactive media.
High Linguistic Fidelity: The use of continuous representations before the quantization bottleneck ensures that words and syllables remain clear and accurate.
Nuanced Style Transfer: Captures and replicates subtle characteristics such as specific accents and emotional states, moving beyond simple pitch shifting.
Efficient Modeling: The low total number of quantization levels across dimensions allows for a highly compressed yet expressive content representation.
Flexible Identity Control: Can adapt to a wide range of target speakers by simply extracting a style embedding from a short audio sample.

Related Materials

Contact

Learn About UC TechAlerts - Save Searches and receive new technology matches

Realtime Transformation Of Voice Identity And Style

Patent Status

Brief Description

Suggested uses

Advantages

Related Materials

Contact

Inventors

Other Information

Categorized As

Additional Technologies by these Inventors

Realtime Transformation Of Voice Identity And Style

Patent Status

Brief Description

Suggested uses

Advantages

Related Materials

Share This

Contact

Inventors

Other Information

Categorized As

Related cases

Additional Technologies by these Inventors