Multitalk: Generate talking head video from image and audio

Demo summary
The user demonstrates uploading a reference image and an audio clip to Multi-Talk, configuring background removal and text prompts to generate a video of a woman speaking in a park.
Step-by-step
- Select a character injection mode, such as 'inject only people or objects'
- Upload a reference image of your character
- Toggle 'remove background' if you want to place the character in a new setting
- Upload the audio clip for the character to speak
- Enter a text prompt describing the final scene and background
- Select the aspect ratio and resolution
- Calculate and set the number of frames based on audio duration (seconds multiplied by 25)
- Adjust inference steps and guidance scales if needed, then click Generate
Options
- Generate video using only a text description without a reference image
- Keep the original background from the reference image
- Activate T-cache or Meg-cache to speed up generation by up to 2.5x
- Use temporal or spatial upsamplers for higher quality output
Watch out for
- The first run requires downloading large files, including a 15GB text-video fusion model
- Lowering inference steps to speed up generation will sacrifice video quality
- T-cache and Meg-cache speed up generation by skipping steps, which may impact quality
Tips
- Upload a reference image rather than using text-only for more control over the character
- If the lip-sync is inaccurate, set the audio guidance to a higher value
- Leave advanced settings at default as they usually work fine without adjustment
- Use a GPU with sufficient VRAM (e.g., 16GB) to avoid needing speed-optimization caches
Highlights
“usually everything just works fine without you needing to change any of these settings”
All demos from “Make AI videos with talking + pose + reference control. MultiTalk & VACE tutorial”
5:271:27Overview of the Wan2GP interface for Multi-TalkThe creator walks through the Wan2GP Gradio interface, explaining how to select the Multi-Talk model and the specific 'Vase Multi-Talk Fusion X' version for better performance on low VRAM.Multitalk· AI Animation Generator
8:224:37Generate talking head video from image and audioCurrentThe user demonstrates uploading a reference image and an audio clip to Multi-Talk, configuring background removal and text prompts to generate a video of a woman speaking in a park.Multitalk· AI Avatar Video Generator
13:571:32Simulate angry expressions with Multi-TalkThe demo shows how to use an angry reference image and matching audio to generate a highly expressive video that captures the pitch and intensity of the speaker's anger.Multitalk· AI Lip Sync Generator
15:291:10Animate sad emotions and cryingThe creator demonstrates Multi-Talk's ability to handle complex emotions by animating a sad character who pauses and breathes in sync with a crying audio track.Multitalk· AI Lip Sync Generator
17:441:28Lip-syncing anime charactersA demonstration of applying Japanese audio to an anime still image, showing how the tool handles non-human characters and different languages.Multitalk· AI Lip Sync Generator
19:393:03Animate multiple speakers in a podcast sceneThe video shows how to configure Multi-Talk for two speakers by uploading an image of two people and two sequential audio clips, assigning voices based on their position in the frame.Multitalk· AI Avatar Video Generator
22:173:21Parallel multi-speaker animationThe user demonstrates a more advanced multi-speaker setup where two audio tracks are played in parallel to animate a conversation between two people in a single reference image.Multitalk· AI Avatar Video Generator
26:322:34Transfer human motion with VACE and Multi-TalkThe demo shows how to use a control video of a person dancing to drive the body movements of a reference image while simultaneously applying a Spanish lip-sync track.Multitalk· Video to Video- Watch “Make AI videos with talking + pose + reference control. MultiTalk & VACE tutorial” →
AI Avatar Video Generator
8:224:37Generate talking head video from image and audioCurrentThe user demonstrates uploading a reference image and an audio clip to Multi-Talk, configuring background removal and text prompts to generate a video of a woman speaking in a park.AI Search
19:393:03Animate multiple speakers in a podcast sceneThe video shows how to configure Multi-Talk for two speakers by uploading an image of two people and two sequential audio clips, assigning voices based on their position in the frame.AI Search
22:173:21Parallel multi-speaker animationThe user demonstrates a more advanced multi-speaker setup where two audio tracks are played in parallel to animate a conversation between two people in a single reference image.AI Search
0:500:47Multi-person conversational video generationMultiTalk is shown animating a group image where two separate people interact and respond to each other using different audio tracks.NadimExplainsAI
Multitalk