Abstract
Speaker anonymization aims to conceal cues to speaker identity while preserving linguistic content. Current machine learning based approaches require substantial computational resources, hindering real-time streaming applications. To address these concerns, we propose a streaming model that achieves speaker anonymization with low latency. The system is trained in an end-to-end autoencoder fashion using a lightweight content encoder that extracts HuBERT-like information, a pretrained speaker encoder that extract speaker identity, and a variance encoder that injects pitch and energy information. These three disentangled representations are fed to a decoder that re-synthesizes the speech signal. We present evaluation results from two implementations of our system, a full model that achieves a latency of 230ms, and a lite version (0.1x in size) that further reduces latency to 66ms while maintaining state-of-the-art performance in naturalness, intelligibility, and privacy preservation.
Block Diagram
Notes
- Dataset (CMU-ARCTIC corpus): http://www.festvox.org/cmu_arctic/
Audio Samples
- Input speech: original unmodified speech recordings
- Base: input speech anonymized through the base version
- Lite: input speech anonymized through the lite version
Speaker | Text | Input speech | Base | Lite |
---|---|---|---|---|
BDL | Author of the danger trail Philip Steels and etc. | |||
Not at this particular case Tom apologized Whittemore. | ||||
For the twentieth time that evening the two men shook hands. | ||||
CLB | Lord but I'm glad to see you again Phil. | |||
Will we ever forget it. | ||||
God bless 'em I hope I will go on seeing them forever. | ||||
RMS | And you always want to see it in the superlative degree. | |||
Gad your letter came just in time. | ||||
He turned sharply and faced Gregson across the table. | ||||
SLT | I'm playing a single hand in what looks like a losing game. | |||
If I ever needed a fighter in my life I need one now. | ||||
Gregson shoved back his chair and rose his feet. |