Setting up Stable Audio 3.0 in ComfyUI

Demo summary
The creator demonstrates how to access the Stable Audio 3.0 workflow via ComfyUI templates, download the necessary checkpoint and text encoder models, and navigate the prompt nodes for music and sound effect generation.
Step-by-step
- Update ComfyUI to the latest version
- Open Comfy templates and select the Stable Audio Open 1.0 Medium model workflow
- Click the download buttons on the right side of the interface to acquire the checkpoint and two text encoders
- Move the checkpoint file into the checkpoints folder and the text encoders into the text encoder folder
- Configure the duration and seed in the prompt node
- Select the appropriate category for your generation, such as music, instrument, or sound effect
- Run the workflow to generate and save the MP3
Options
- Swap the standard VAE decode for a tiled audio decoder if you encounter issues
Watch out for
- You must update ComfyUI to the latest version to access the templates
- The workflow requires three specific models: one checkpoint and two text encoders
Tips
- Use the 'music' category if you are trying to generate a full ensemble
- Use the 'instrument' category if you only want to generate a single instrument
AI Music Generator
0:441:57Generate audio from text in ComfyUI with Stable Audio 3The creator demonstrates setting up a ComfyUI workflow using the Stable Audio 3 Medium model and T5 Gemma text encoder to generate a 30-second instrumental audio clip from a text prompt.pixaroma
3:340:28Using Tiled VAE Decode for low VRAM audio generationThe narrator shows how to replace the standard VAE decode node with a tiled version in ComfyUI to handle longer audio generations (120 seconds) on hardware with lower VRAM.pixaroma
8:420:58Enhance audio prompts with Gemma 4A workflow is shown where a simple piano prompt is processed by a Gemma 4 model to generate a more detailed audio prompt before being passed to the Stable Audio sampler.pixaroma
10:550:28Image-to-Music generation in ComfyUIThe creator demonstrates an experimental workflow that loads an image of a bunny, uses Gemma 4 to describe it as a music prompt, and then generates corresponding audio.pixaroma
1:191:36Generate music with Stable Audio 3 Medium in ComfyUIThe creator demonstrates loading the Stable Audio 3 medium model in ComfyUI, setting up a prompt for 'gothic techno', and configuring the audio duration to 95 seconds before generating the track.Nerdy Rodent
3:330:34Low VRAM audio generation with Small modelThe user switches to the 2GB 'small' model in ComfyUI to demonstrate audio generation suitable for low-end GPUs.Nerdy Rodent
6:311:11AI Prompt Generation with GemmaThe creator shows an 'audio to text to audio' pipeline using Gemma to describe an audio input and generate a new prompt for the music generation node.Nerdy Rodent
10:170:59Audio conditioning with voice inputThe user demonstrates using a 30-second vocal recording as an input combined with a text prompt and a linear quadratic scheduler to generate a new track.Nerdy Rodent
1:151:28Setting up Stable Audio 3.0 in ComfyUICurrentThe creator demonstrates how to access the Stable Audio 3.0 workflow via ComfyUI templates, download the necessary checkpoint and text encoder models, and navigate the prompt nodes for music and sound effect generation.REBEL AI
ComfyUI