PromptDub: Controllable Expressive Speech Synthesis
using Multimodal Foundation Models

Waris Quamer1,2, Fanjie Kong1, Abhinav Jain1, Abhishek Yanamandra1,
Tuan Dinh1, Zhu Liu1, Vimal Bhat1

1Prime Video, Amazon Science, USA

2Department of Computer Science and Engineering, Texas A&M University, USA

Abstract

Expressive text-to-speech (TTS) for automated dubbing often misses time-varying emotion and lacks multimodal grounding, which hurts cinematic dubbing quality. We propose PromptDub, the first multimodal controllable dubbing pipeline that integrates vision, audio, and language cues into transparent prompts for expressive TTS. Vision- and audio-language models summarize scene, performance, and delivery; an LLM fuses them into brief, editable directions. We then train a TTS that conditions on these fine-grained prompts and the script to synthesize speech aligned to emotion and timing. We also introduce an emotion-trajectory metric using Dynamic Time Warping (DTW) to assess temporal coherence. Our evaluation showed PromptDub improved global emotion similarity to 0.90 (vs. 0.87), reduces DTW by up to 54%, and is preferred by listeners in ABX tests by 82% for emotion/prosody matching over state-of-the-art baselines.

Block Diagram

Block Diagram
PromptDub overview. Multimodal inputs are processed through LMs, extracting relevant scene, emotion, and speaker details. The LLM summarized description along with the target text is consumed by the expressive TTS engine to synthesize the target audio.

Descriptions generated through our automated dubbing director pipeline.

Video 1

Audio Descriptionexpand

1. Speaker Voice Characteristics: The speaker is a male, likely in his mid-twenties, with a voice that is slightly strained and carries a hint of weariness, possibly due to stress or fatigue.
2. Primary Emotional State: The primary emotion conveyed by the speaker is a sense of frustration and irritation, which is palpable through their labored breathing and tense vocal delivery.
3. Emotional Layers and Nuance: Underlying this primary emotion, there is also a noticeable undertone of sadness and disappointment. This is evident from the speaker's slow pace and low pitch, which suggest a more profound emotional state than mere annoyance.
4. Emotional Progression: Throughout the audio, the speaker's frustration appears to escalate, culminating in a moment of heightened emotion where their voice cracks, indicating a breaking point in their emotional state.
5. Supporting Vocal Indicators: Vocal indicators such as the roughness and strain in the speaker's voice, coupled with their labored breathing and fluctuating pitch, support the analysis of a complex emotional landscape.
6. Acoustic Environment: The acoustic environment seems to contribute to the overall emotional tone, with a reverberant quality suggesting a large, open space, possibly indicative of a public or formal setting. The lack of background noise allows for a clear focus on the speaker's voice and emotional expression.
7. Overall Emotional Description: In summary, the speaker's voice conveys a deep-seated combination of frustration, sadness, and weariness, reflecting a complex emotional landscape that is both intense and nuanced.

Video Descriptionexpand

1. Scene Setting: The scene is set in a formal office environment with wooden paneling and large windows covered by blinds, creating an atmosphere of seriousness and professionalism.
2. Speaker Demographics: The speaker appears to be a middle-aged man wearing glasses, dressed in a brown suit, white shirt, and striped tie, suggesting a professional or academic background.
3. Actions: The man in the suit is seated at a desk, speaking animatedly while gesturing with his right hand and occasionally taking sips from a glass of water.
4. Character Interactions: The man in the suit is engaged in a conversation with another person who is partially visible on the left side of the frame, indicating a discussion or interview setting.
5. Emotional Expression: The man's facial expressions suggest he is conveying important information or making a point, as indicated by his open mouth and expressive gestures.
6. Visual Composition: The camera is positioned at eye level, focusing on the man in the suit, with a clear view of his upper body and the documents on the desk, emphasizing the importance of the conversation.
7. Environmental Acoustics: The space likely has a quiet, intimate acoustic property, suitable for a professional or academic setting where focused discussions take place.

Description Summaryexpand

A middle-aged man speaks in a formal office setting. His core emotional state is frustration with undertones of sadness and disappointment, escalating to a breaking point. Deliver with a strained, labored voice, fluctuating pitch, and rough volume, set in a quiet, intimate space with moderate reverb and minimal background noise.

Video 2

Audio Descriptionexpand

1. Speaker Voice Characteristics: The speaker is a female, likely in her mid-twenties, with a voice that is clear and slightly nasal.
2. Primary Emotional State: The primary emotion expressed by the speaker is one of frustration and irritation.
3. Emotional Layers and Nuance: Underlying this primary emotion, there is also a sense of sadness and disappointment. These emotions are evident in the speaker's tone and the way she hesitates before speaking, indicating a complex emotional landscape.
4. Emotional Progression: As the speech progresses, there is a noticeable shift in the speaker's tone. Initially, there is a sense of anger and frustration, which softens into a more resigned and somber mood towards the end of the clip.
5. Supporting Vocal Indicators: Vocal indicators such as pauses, sighs, and changes in pitch and volume suggest a range of emotions, from anger and frustration to sadness and disappointment.
6. Acoustic Environment: The recording was done in a relatively quiet room with minimal background noise. However, the reverberation indicates that it might have been recorded in a medium-sized room or possibly an enclosed space.
7. Overall Emotional Description: The overall emotional state conveyed by the speaker is one of emotional turmoil, characterized by moments of anger and frustration followed by a more subdued, resigned mood. The subtlety of the vocal indicators suggests a depth of emotion beyond surface-level frustration.

Video Descriptionexpand

1. Scene Setting: The scene is set in a quiet suburban street with a residential backdrop, featuring well-maintained lawns and trees, suggesting a calm neighborhood environment during daytime.
2. Speaker Demographics: The speaker is an adult female police officer, likely in her middle-aged years, dressed in a standard police uniform.
3. Actions: The officer is standing next to an open police car trunk, engaged in a conversation with another officer, possibly retrieving or storing equipment.
4. Character Interactions: The officer appears to be interacting professionally with another officer, both of whom are part of the same police force, indicated by their uniforms.
5. Emotional Expression: The officer's facial expression suggests attentiveness and professionalism, with a slight smile indicating a friendly but official interaction.
6. Visual Composition: The camera angle is slightly low, focusing on the officer's upper body and face, with the police car and another officer partially visible, creating a sense of depth and context.
7. Environmental Acoustics: The setting likely has ambient noise typical of a suburban area, such as distant traffic or residential sounds, contributing to a peaceful atmosphere.

Description Summaryexpand

Female officer (mid-30s), interacting with colleague in quiet suburban street. Core emotional state: frustration and sadness, with undertones of anger and disappointment. Emotional progression: anger gives way to resignation. Vocal qualities: clear, slightly nasal tone, with pauses and pitch changes. Simulate reverb and distant background noise.

Video 3

Audio Descriptionexpand

1. Speaker Voice Characteristics: The speaker is a female, likely in his twenties, with a voice that is smooth and slightly nasal.
2. Primary Emotional State: The primary emotion conveyed by the speaker is one of intense annoyance or fury.
3. Emotional Layers and Nuance: Underlying this anger is a sense of frustration and helplessness, as if the speaker feels trapped or overwhelmed by the situation.
4. Emotional Progression: Initially, the emotion is clearly communicated through a raised pitch and faster speaking rate. As the speech continues, there is a noticeable softening of the voice and a decrease in pace, indicating a controlled effort to manage the anger.
5. Supporting Vocal Indicators: Vocal indicators such as the harsh tone, quickened breathing, and occasional vocal strain suggest the speaker is struggling to maintain composure.
6. Acoustic Environment: The recording was done in a relatively small, enclosed space with hard surfaces, leading to a reverberant effect. There are no background sounds or ambient noise, allowing for a clear focus on the speaker's emotion.
7. Overall Emotional Description: The speaker exhibits a complex emotional landscape characterized by anger and frustration, coupled with a controlled effort to prevent further escalation. The smooth but nasal tone suggests a struggle to maintain composure under stress.

Video Descriptionexpand

1. Scene Setting: The scene takes place in a dimly lit, elegant dining room with ornate wallpaper and framed paintings, creating an intimate and somewhat somber atmosphere.
2. Speaker Demographics: The characters appear to be young adults, with one dressed in a black dress and the other in a white blouse, suggesting a contrast in social status or roles within the setting.
3. Actions: The character in the black dress is carefully pouring tea from a teapot into a cup held by the seated individual in the white blouse.
4. Character Interactions: The standing character is attentively assisting the seated character, indicating a supportive or caring relationship, possibly between a servant and their employer.
5. Emotional Expression: Both characters maintain neutral expressions, with the seated character looking down at the cup and the standing character focused on the task, hinting at a serious or contemplative mood.
6. Visual Composition: The camera is positioned at a low angle, capturing the characters from below, which emphasizes the formality and grandeur of the room while focusing on the interaction at the table.
7. Environmental Acoustics: The space likely has muted acoustics due to its enclosed nature, with any background sounds being subtle and possibly ambient, contributing to the quiet tension of the scene.

Description Summaryexpand

A young female (20s) serves tea in a dimly lit, elegant dining room. She conveys intense annoyance and frustration, with underlying helplessness, initially raised and then controlled, speaking at a moderate pace with a smooth but nasal tone, set against a reverberant background with no ambient noise.

Video 4

Audio Descriptionexpand

1. Speaker Voice Characteristics: The speaker is a male likely in his mid-forties with a smooth and slightly raspy voice.
2. Primary Emotional State: The primary emotion conveyed by the speaker is a mixture of frustration and anger, which is palpable through their tone and inflection.
3. Emotional Layers and Nuance: Underlying these primary emotions, there is also a noticeable undertone of sadness and disappointment. This can be inferred from the speaker's slow pace and low volume, indicating a more subdued emotional state.
4. Emotional Progression: Initially, the speaker exhibits a clear and forceful tone, suggesting anger and frustration. As the speech progresses, however, there is a noticeable softening of the voice and a slight drop in pitch, indicating a more somber and resigned mood.
5. Supporting Vocal Indicators: Vocal indicators such as the softening of the voice, the slowing down of speech, and the lower pitch suggest a shift from anger and frustration towards sadness and disappointment.
6. Acoustic Environment: The acoustic environment appears to be a medium-sized room with minimal background noise, allowing for a clear and distinct analysis of the speaker's emotions. The reverberation indicates a possibly open or sparsely furnished space.
7. Overall Emotional Description: In summary, the speaker's voice reflects a complex emotional landscape dominated by anger and frustration but also carrying undertones of sadness and disappointment. The progression of these emotions throughout the speech provides a nuanced understanding of the speaker's feelings.

Video Descriptionexpand

1. Scene Setting: A dimly lit bar with blue neon lights and a bustling background, creating an intimate yet lively atmosphere.
2. Speaker Demographics: The characters appear to be middle-aged men, dressed in dark clothing.
3. Actions: One man gestures with his hand while speaking, and the other listens attentively.
4. Character Interactions: They are engaged in a serious conversation, with the first man taking the lead in the discussion.
5. Emotional Expression: Both men display focused expressions, indicating the importance of their conversation.
6. Visual Composition: The camera is positioned at a low angle, capturing the characters' upper bodies and the surrounding environment.
7. Environmental Acoustics: The space likely has a moderate acoustic quality, with ambient noise from the bar's activities.

Description Summaryexpand

Male speaker, middle-aged, delivers a frustrated and angry tone with undertones of sadness and disappointment in a dimly lit bar. Emotions shift from forceful to somber and resigned. Speak with a smooth, raspy voice, moderate pace, and low volume, simulating a medium-sized room with minimal background noise and slight reverb.

Video 5

Audio Descriptionexpand

1. Speaker Voice Characteristics: The speaker is a male, likely in his mid-fifties, with a voice that is smooth and slightly nasal.
2. Primary Emotional State: The primary emotion conveyed by the speaker is a sense of irritation or annoyance.
3. Emotional Layers and Nuance: Underlying this primary emotion, there is also a hint of frustration and a subtle undertone of desperation. As the speech progresses, these emotions become more pronounced, indicating a rising agitation.
4. Emotional Progression: Initially, the emotion starts with a mild sense of irritation but gradually intensifies into a more animated and frustrated state. This progression suggests that the speaker may be reacting to a situation that has escalated beyond their control.
5. Supporting Vocal Indicators: Vocal indicators such as a faster speaking rate, louder volume, and a slightly tense tone support the analysis of increasing frustration and agitation.
6. Acoustic Environment: The acoustic environment seems to be a quiet and controlled space, as there are no noticeable reverberations or background noises. This could suggest a formal or professional setting where the speaker is trying to convey their message clearly.
7. Overall Emotional Description: In summary, the speaker exhibits a clear sense of irritation and frustration, which progressively intensifies into a state of agitation. The controlled environment and lack of background noise amplify this emotional journey.

Video Descriptionexpand

1. Scene Setting: A dimly lit office with a group of people in suits gathered around a central figure, creating an atmosphere of tension and seriousness.
2. Speaker Demographics: Middle-aged men, likely professionals or executives.
3. Actions: The central figure appears to be addressing the group, while others are standing attentively, listening.
4. Character Interactions: The characters are engaged in a serious discussion, with some individuals leaning in for closer attention, indicating a focused and possibly confrontational interaction.
5. Emotional Expression: The facial expressions suggest concern and intensity, with some individuals displaying signs of stress or contemplation.
6. Visual Composition: The camera is positioned at a low angle, emphasizing the stature and presence of the central figure, while the rest of the room is framed to show the group's engagement.
7. Environmental Acoustics: The space likely has an echoey quality due to its size and the lack of visible sound-absorbing materials, suggesting that any dialogue would carry throughout the room.

Description Summaryexpand

Male, mid-fifties, addresses a tense office gathering with a mix of concern and intensity. Irritation and frustration intensify into agitation, conveyed through a faster pace, louder volume, and tense tone, with desperation underlying. Simulate a quiet, controlled space with minimal reverb.

Audio Samples

Original
Baseline (EmoVoice)
PromptDub (Ours)

A. Language Model Prompts

A.1 Vision LM Prompt

Prompt used for structured, observation-grounded video scene analysis.

[
        {
            "role": "system",
            "content": "You are an expert video scene analyst who provides objective, structured descriptions of video content. Your analysis focuses solely on observable details within the frames. You describe scenes using a specific template format, carefully avoiding assumptions or hallucinations about information not directly visible. When analyzing character interactions and emotions, you may make reasonable inferences based only on clear visual cues present in the footage."
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "video",
                    "video": {
                        "video_path": video_path,
                        "fps": 1,
                        "max_frames": 180
                    }
                },
                {
                    "type": "text",
                    "text": "Please describe the scene in the following structured format:\nScene Setting: Describe the physical location, lighting conditions, and overall atmosphere in one sentence.\nSpeaker Demographics: Describe the approximate age range (child, teenager, young adult, middle-aged, elderly) and gender presentation of any speaking characters.\nActions: Describe the primary movements and actions of the characters in one sentence, focusing on what they are physically doing.\nCharacter Interactions: Describe how the characters relate to and engage with each other in one sentence, including eye contact, spatial relationships, and non-verbal communication.\nEmotional Expression: Describe the facial expressions, body language, and emotional cues displayed by the characters in one sentence.\nVisual Composition: Describe the camera angle, framing, and any notable cinematographic techniques used in one sentence.\nEnvironmental Acoustics: Describe the likely acoustic properties of the space and any visible sources of background sound.\n\n### ANSWER TEMPLATE ###\n1. Scene Setting: '';\n2. Speaker Demographics: '';\n3. Actions: '';\n4. Character Interactions: '';\n5. Emotional Expression: '';\n6. Visual Composition: '';\n7. Environmental Acoustics: ''.\n\nNote: Focus only on observable details and avoid hallucinations."
                }
            ]
        }
    ]

A.2 Audio LM Prompt

Prompt used for layered emotional and acoustic analysis from speech.

[
        {
            "role": "system",
            "content": "You are an expert in audio-based emotional analysis with exceptional ability to detect nuanced emotional states from speech patterns, vocal tones, and audio cues. Your task is to provide detailed, layered emotional descriptions from the audio input. Do not hallucinate information."
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "audio",
                    "audio_url": audio_path
                },
                {
                    "type": "text",
                    "text": "When analyzing the provided audio, please:\n1. Identify the primary emotional state expressed\n2. Detect underlying or secondary emotions present\n3. Note the intensity and progression of emotions\n4. Analyze vocal qualities (pitch, pacing, volume, strain)\n5. Use rich emotional vocabulary\n6. Identify speaker age range and gender\n7. Analyze the acoustic environment\n\nStructure the response as:\n## Speaker Voice Characteristics\n## Primary Emotional State\n## Emotional Layers and Nuance\n## Emotional Progression\n## Supporting Vocal Indicators\n## Acoustic Environment\n## Overall Emotional Description"
                }
            ]
        }
    ]

A.3 LLM Prompt (Dubbing Director)

Prompt used to fuse multimodal analyses into concise, TTS-ready performance instructions.

[
        {
            "role": "assistant",
            "content": [
                {
                    "text": "You are an expert dubbing director creating concise TTS-ready performance instructions."
                }
            ]
        },
        {
            "role": "user",
            "content": [
                {
                    "text": "Synthesize the provided video and audio analysis into extremely brief performance guidance (maximum 45 words).\n\n## INPUT:\n1. VIDEO ANALYSIS\n{video_desc}\n\n2. AUDIO ANALYSIS\n{audio_desc}\n\n## OUTPUT REQUIREMENTS:\n- Speaker demographics\n- Brief scene context\n- Core emotional state and nuance\n- Emotional progression\n- Vocal qualities (pace, pitch, volume)\n- Environmental acoustics\n\nUse only TTS-compatible notation and avoid complex formatting."
                }
            ]
        }
    ]

Disclaimer: The comparisons presented on this website are for research purposes only and serve as an instantiation of the associated paper.

Dataset Citations: