TVTSyn: Content-Synchronous Time-Varying Timbre for Streaming Voice Conversion and Anonymization
Introduces a streamable synthesizer that aligns speaker identity and content at frame level, reaching under 80 ms GPU latency while improving naturalness, transfer, and anonymization.
