XCAT 3.0: Parallel multi-speaker animation

XCAT 3.0 Try it →Watch full video →AI Search · Jul 2025

Demo summary

The user demonstrates a more advanced multi-speaker setup where two audio tracks are played in parallel to animate a conversation between two people in a single reference image.

Step-by-step

Upload a reference image containing two people
Select the 'In parallel' option for audio processing
Prepare two separate audio files of identical duration by splitting a conversation into individual tracks with silence where the other person is speaking
Upload both audio clips to the interface
Set the video length in frames to match the audio duration (seconds multiplied by 25 frames per second)
Enable 'tcash' for skip stepping and select a 2x speed up
Set skip stepping to start at 10% of the generation
Click Generate

Options

Select 'Two audio sources played in a row' for sequential speaking or singing
Keep or remove the background from the reference image

Watch out for

Parallel audio clips must have the exact same duration
The system assumes the person on the left speaks first; the reference image must align with this logic
Post-processing in an external audio editor is required to create the individual tracks with timed silence

Tips

Use an external audio editor to mute the opposite voice on each track to ensure clean separation
Minimal prompting is needed when using a strong reference image
XCAT 3.0 can capture realistic nuances like stutters, clicks, and pauses from the audio

Highlights

“everything looks very fluid and natural and realistic”

All demos from “Make AI videos with talking + pose + reference control. MultiTalk & VACE tutorial”

5:271:27Overview of the Wan2GP interface for Multi-TalkThe creator walks through the Wan2GP Gradio interface, explaining how to select the Multi-Talk model and the specific 'Vase Multi-Talk Fusion X' version for better performance on low VRAM.XCAT 3.0· AI Animation Generator
8:224:37Generate talking head video from image and audioThe user demonstrates uploading a reference image and an audio clip to Multi-Talk, configuring background removal and text prompts to generate a video of a woman speaking in a park.XCAT 3.0· AI Avatar Video Generator
13:571:32Simulate angry expressions with Multi-TalkThe demo shows how to use an angry reference image and matching audio to generate a highly expressive video that captures the pitch and intensity of the speaker's anger.XCAT 3.0· AI Lip Sync Generator
15:291:10Animate sad emotions and cryingThe creator demonstrates Multi-Talk's ability to handle complex emotions by animating a sad character who pauses and breathes in sync with a crying audio track.XCAT 3.0· AI Lip Sync Generator
17:441:28Lip-syncing anime charactersA demonstration of applying Japanese audio to an anime still image, showing how the tool handles non-human characters and different languages.XCAT 3.0· AI Lip Sync Generator
19:393:03Animate multiple speakers in a podcast sceneThe video shows how to configure Multi-Talk for two speakers by uploading an image of two people and two sequential audio clips, assigning voices based on their position in the frame.XCAT 3.0· AI Avatar Video Generator
22:173:21Parallel multi-speaker animationCurrentThe user demonstrates a more advanced multi-speaker setup where two audio tracks are played in parallel to animate a conversation between two people in a single reference image.XCAT 3.0· AI Avatar Video Generator
26:322:34Transfer human motion with VACE and Multi-TalkThe demo shows how to use a control video of a person dancing to drive the body movements of a reference image while simultaneously applying a Spanish lip-sync track.XCAT 3.0· Video to Video
Watch “Make AI videos with talking + pose + reference control. MultiTalk & VACE tutorial” →

AI Avatar Video Generator

8:224:37Generate talking head video from image and audioThe user demonstrates uploading a reference image and an audio clip to Multi-Talk, configuring background removal and text prompts to generate a video of a woman speaking in a park.AI Search
19:393:03Animate multiple speakers in a podcast sceneThe video shows how to configure Multi-Talk for two speakers by uploading an image of two people and two sequential audio clips, assigning voices based on their position in the frame.AI Search
22:173:21Parallel multi-speaker animationCurrentThe user demonstrates a more advanced multi-speaker setup where two audio tracks are played in parallel to animate a conversation between two people in a single reference image.AI Search
0:500:47Multi-person conversational video generationMultiTalk is shown animating a group image where two separate people interact and respond to each other using different audio tracks.NadimExplainsAI