ComfyUI: Generate temporal face swap with WAN Video

Demo summary

The user combines the reference identity, pose data, and masks into the WAN Video sampler to generate a temporally consistent video sequence.

Combine the reference identity, pose information, original background, and mask into the WAN Video sampler
Run the sampler to generate the new sequence
Decode the latent result back into image frames
Reassemble the frames into a final video
Import the original audio to sync with the new footage

0:290:24Load source footage and models in ComfyUIThe user demonstrates importing source video footage and loading the necessary model nodes including WAN Video, VAE, and Clip Vision within the ComfyUI interface.ComfyUI· AI Animation Generator
0:530:48Generate and refine head masks with Florence 2 and SAM 2The workflow shows using Florence 2 for object detection to target the head and SAM 2 for precise segmentation, including adjusting the 'grow mask expand' value to improve blending.ComfyUI· AI Inpainting
1:410:46Prepare driving data and auto-prompts with Qwen2-VLThe demo shows running pose detection on source footage and using Qwen2-VL (referred to as Gwen VL) to generate a semantic text description from a reference image for the face swap.ComfyUI· AI Face Swap Generator
2:270:35Generate temporal face swap with WAN VideoCurrentThe user combines the reference identity, pose data, and masks into the WAN Video sampler to generate a temporally consistent video sequence.ComfyUI· AI Face Swap Video
3:280:27Simplified face swap using ComfyUI App ModeA demonstration of the simplified 'App Mode' interface where users can upload footage and a reference image to perform a face swap without interacting with the node graph.ComfyUI· AI Face Swap Generator
Watch “Learn How To: Face Swap” →

16:071:11Replace character face in generated videoA demonstration of a character replacement pass that swaps a face in a Sora-generated video with a specific reference photo to maintain identity consistency.Yaroflasher
2:270:35Generate temporal face swap with WAN VideoCurrentThe user combines the reference identity, pose data, and masks into the WAN Video sampler to generate a temporally consistent video sequence.ComfyUI