RAFT-Edit Rectified-Flow Audio Demo

Age manipulation

Age pseudo-label response from the external speech age predictor.

Cohort Source / reconstruction -2.0 -1.0 0.0 +1.0 +2.0
20s M
Source
Reconstruction
value 0.427 · cos 0.86
LibriTTS · 7640_111784_000011_000002
-2.0
value 0.279 · cos 0.64
-1.0
value 0.246 · cos 0.65
+0.0
value 0.427 · cos 0.86
+1.0
value 0.599 · cos 0.58
+2.0
value 0.737 · cos 0.49
20s F
Source
Reconstruction
value 0.408 · cos 0.88
VoxCeleb1 · 00011
-2.0
value 0.235 · cos 0.61
-1.0
value 0.240 · cos 0.46
+0.0
value 0.408 · cos 0.88
+1.0
value 0.589 · cos 0.48
+2.0
value 0.757 · cos 0.51
40s M
Source
Reconstruction
value 0.416 · cos 0.92
NaturalVoices · MSP-PODCAST_1632_104
-2.0
value 0.157 · cos 0.63
-1.0
value 0.252 · cos 0.68
+0.0
value 0.416 · cos 0.92
+1.0
value 0.710 · cos 0.56
+2.0
value 0.772 · cos 0.62
40s F
Source
Reconstruction
value 0.520 · cos 0.89
GLOBE · 00105921_000004.v2.vad
-2.0
value 0.203 · cos 0.58
-1.0
value 0.287 · cos 0.59
+0.0
value 0.520 · cos 0.89
+1.0
value 0.708 · cos 0.62
+2.0
value 0.809 · cos 0.58
60s M
Source
Reconstruction
value 0.588 · cos 0.90
LibriTTS · 6804_79287_000013_000013
-2.0
value 0.198 · cos 0.56
-1.0
value 0.407 · cos 0.64
+0.0
value 0.588 · cos 0.90
+1.0
value 0.739 · cos 0.50
+2.0
value 0.763 · cos 0.52
60s F
Source
Reconstruction
value 0.488 · cos 0.89
NaturalVoices · MSP-PODCAST_5606_43
-2.0
value 0.295 · cos 0.55
-1.0
value 0.394 · cos 0.66
+0.0
value 0.488 · cos 0.89
+1.0
value 0.719 · cos 0.71
+2.0
value 0.762 · cos 0.70

Perceived gender presentation manipulation

Model-predicted male-presentation probability. This is a pseudo-label, not demographic ground truth.

Cohort Source / reconstruction -2.0 -1.0 0.0 +1.0 +2.0
20s M
Source
Reconstruction
value 0.995 · cos 0.83
GLOBE · 00266973_000004.v2.vad
-2.0
value 0.001 · cos 0.73
-1.0
value 0.099 · cos 0.79
+0.0
value 0.995 · cos 0.83
+1.0
value 0.998 · cos 0.64
+2.0
value 0.997 · cos 0.71
20s F
Source
Reconstruction
value 0.007 · cos 0.89
VoxCeleb1 · 00002
-2.0
value 0.003 · cos 0.72
-1.0
value 0.006 · cos 0.84
+0.0
value 0.007 · cos 0.89
+1.0
value 0.814 · cos 0.60
+2.0
value 0.996 · cos 0.59
40s M
Source
Reconstruction
value 0.752 · cos 0.86
GLOBE · 00431096_000005.v2.vad
-2.0
value 0.001 · cos 0.69
-1.0
value 0.195 · cos 0.80
+0.0
value 0.752 · cos 0.86
+1.0
value 0.995 · cos 0.61
+2.0
value 0.994 · cos 0.72
40s F
Source
Reconstruction
value 0.525 · cos 0.82
NaturalVoices · MSP-PODCAST_0478_1
-2.0
value 0.002 · cos 0.67
-1.0
value 0.001 · cos 0.52
+0.0
value 0.525 · cos 0.82
+1.0
value 0.975 · cos 0.62
+2.0
value 0.954 · cos 0.66
60s M
Source
Reconstruction
value 0.789 · cos 0.89
LibriTTS · 3955_181692_000011_000004
-2.0
value 0.001 · cos 0.75
-1.0
value 0.280 · cos 0.62
+0.0
value 0.789 · cos 0.89
+1.0
value 0.997 · cos 0.60
+2.0
value 0.989 · cos 0.67
60s F
Source
Reconstruction
value 0.002 · cos 0.92
LibriTTS · 1752_16632_000036_000011
-2.0
value 0.001 · cos 0.73
-1.0
value 0.001 · cos 0.88
+0.0
value 0.002 · cos 0.92
+1.0
value 0.510 · cos 0.65
+2.0
value 0.991 · cos 0.69

Habitual pitch manipulation

Acoustic log-F0 median measured from generated waveform.

Cohort Source / reconstruction -2.0 -1.0 0.0 +1.0 +2.0
20s M
Source
Reconstruction
value 4.957 · cos 0.87
NaturalVoices · MSP-PODCAST_0658_440
-2.0
value 4.572 · cos 0.72
-1.0
value 4.590 · cos 0.81
+0.0
value 4.957 · cos 0.87
+1.0
value 5.320 · cos 0.66
+2.0
value 5.524 · cos 0.63
20s F
Source
Reconstruction
value 5.253 · cos 0.89
GLOBE · 00022382_000004.v2.vad
-2.0
value 4.551 · cos 0.77
-1.0
value 4.893 · cos 0.76
+0.0
value 5.253 · cos 0.89
+1.0
value 5.435 · cos 0.77
+2.0
value 5.460 · cos 0.76
40s M
Source
Reconstruction
value 4.944 · cos 0.90
LibriTTS · 90_130566_000006_000001
-2.0
value 4.569 · cos 0.78
-1.0
value 4.616 · cos 0.72
+0.0
value 4.944 · cos 0.90
+1.0
value 5.312 · cos 0.66
+2.0
value 5.498 · cos 0.71
40s F
Source
Reconstruction
value 5.042 · cos 0.91
LibriTTS · 7481_101276_000071_000000
-2.0
value 4.525 · cos 0.76
-1.0
value 4.540 · cos 0.86
+0.0
value 5.042 · cos 0.91
+1.0
value 5.400 · cos 0.52
+2.0
value 5.498 · cos 0.60
60s M
Source
Reconstruction
value 4.894 · cos 0.92
VoxCeleb1 · 00038
-2.0
value 4.465 · cos 0.75
-1.0
value 4.632 · cos 0.66
+0.0
value 4.894 · cos 0.92
+1.0
value 5.296 · cos 0.65
+2.0
value 5.414 · cos 0.66
60s F
Source
Reconstruction
value 4.924 · cos 0.89
LibriTTS · 8778_246974_000024_000009
-2.0
value 4.522 · cos 0.72
-1.0
value 4.594 · cos 0.81
+0.0
value 4.924 · cos 0.89
+1.0
value 5.258 · cos 0.58
+2.0
value 5.488 · cos 0.66

Voice-quality / HNR manipulation

HNR proxy measured from generated waveform. This metric is noisier and can floor on difficult utterances.

Cohort Source / reconstruction -2.0 -1.0 0.0 +1.0 +2.0
20s M
Source
Reconstruction
value -200.000 · cos 0.90
LibriTTS · 27_124992_000059_000001
-2.0
value -200.000 · cos 0.67
-1.0
value -200.000 · cos 0.69
+0.0
value -200.000 · cos 0.90
+1.0
value 1.514 · cos 0.66
+2.0
value 4.417 · cos 0.66
20s F
Source
Reconstruction
value -200.000 · cos 0.89
VoxCeleb1 · 00002
-2.0
value -200.000 · cos 0.68
-1.0
value -200.000 · cos 0.74
+0.0
value -200.000 · cos 0.89
+1.0
value 5.761 · cos 0.73
+2.0
value -0.053 · cos 0.84
40s M
Source
Reconstruction
value -0.526 · cos 0.86
GLOBE · 00057675_000001.v2.vad
-2.0
value -200.000 · cos 0.65
-1.0
value -200.000 · cos 0.68
+0.0
value -0.526 · cos 0.86
+1.0
value 2.843 · cos 0.66
+2.0
value 3.828 · cos 0.64
40s F
Source
Reconstruction
value 5.672 · cos 0.90
GLOBE · 00382646_000013.v2.vad
-2.0
value -200.000 · cos 0.62
-1.0
value -200.000 · cos 0.68
+0.0
value 5.672 · cos 0.90
+1.0
value 13.362 · cos 0.77
+2.0
value 7.864 · cos 0.85
60s M
Source
Reconstruction
value -200.000 · cos 0.85
GLOBE · 00219094_000014.v2.vad
-2.0
value -200.000 · cos 0.65
-1.0
value -200.000 · cos 0.70
+0.0
value -200.000 · cos 0.85
+1.0
value 4.511 · cos 0.68
+2.0
value 6.187 · cos 0.65
60s F
Source
Reconstruction
value 0.106 · cos 0.92
LibriTTS · 4234_187735_000005_000000
-2.0
value -200.000 · cos 0.64
-1.0
value -200.000 · cos 0.62
+0.0
value 0.106 · cos 0.92
+1.0
value 5.212 · cos 0.74
+2.0
value 2.503 · cos 0.83