End-to-end Streaming model for Low-Latency Speech Anonymization

Waris Quamer1, Ricardo Gutierrez-Osuna1

1Department of Computer Science and Engineering, Texas A&M University, USA

Abstract

Speaker anonymization aims to conceal cues to speaker identity while preserving linguistic content. Current machine learning based approaches require substantial computational resources, hindering real-time streaming applications. To address these concerns, we propose a streaming model that achieves speaker anonymization with low latency. The system is trained in an end-to-end autoencoder fashion using a lightweight content encoder that extracts HuBERT-like information, a pretrained speaker encoder that extract speaker identity, and a variance encoder that injects pitch and energy information. These three disentangled representations are fed to a decoder that re-synthesizes the speech signal. We present evaluation results from two implementations of our system, a full model that achieves a latency of 230ms, and a lite version (0.1x in size) that further reduces latency to 66ms while maintaining state-of-the-art performance in naturalness, intelligibility, and privacy preservation.

Block Diagram

Block Diagram
Block diagram of the proposed system (a) Training flow (b) Inference flow.

Notes

Audio Samples

Speaker Text Input speech Base Lite
BDL Author of the danger trail Philip Steels and etc.
Not at this particular case Tom apologized Whittemore.
For the twentieth time that evening the two men shook hands.
CLB Lord but I'm glad to see you again Phil.
Will we ever forget it.
God bless 'em I hope I will go on seeing them forever.
RMS And you always want to see it in the superlative degree.
Gad your letter came just in time.
He turned sharply and faced Gregson across the table.
SLT I'm playing a single hand in what looks like a losing game.
If I ever needed a fighter in my life I need one now.
Gregson shoved back his chair and rose his feet.