End-to-end Streaming model for Low-Latency Speech Anonymization
Waris Quamer1, Ricardo Gutierrez-Osuna1
1Department of Computer Science and Engineering, Texas A&M University, USA
Abstract
Speaker anonymization aims to conceal cues to speaker identity while preserving linguistic content.
Current machine learning based approaches require substantial computational resources, hindering real-time streaming applications.
To address these concerns, we propose a streaming model that achieves speaker anonymization with low latency.
The system is trained in an end-to-end autoencoder fashion using a lightweight content encoder that extracts HuBERT-like information,
a pretrained speaker encoder that extract speaker identity, and a variance encoder that injects pitch and energy information.
These three disentangled representations are fed to a decoder that re-synthesizes the speech signal.
We present evaluation results from two implementations of our system, a full model that achieves a latency of 230ms,
and a lite version (0.1x in size) that further reduces latency to 66ms while maintaining state-of-the-art performance in naturalness, intelligibility,
and privacy preservation.
Block Diagram
Block diagram of the proposed system (a) Training flow (b) Inference flow.