Audio Generation
Audio Generation is a dedicated text-to-speech workspace. Type a line, pick a voice, tune the sliders, generate. Use it for voiceover, narration previews, character dialogue, or any audio that is not already covered by scene-by-scene generation in Scene Mode.
How to reach it
App top bar: Create → Audio Gen. Or the storyboard subnav's Audio Gen chip if you are inside a storyboard.
The workspace
Three regions:
- Composer
- The top panel. Text input, voice settings, and a Generate button.
- Voice browser
- Right sidebar. Searchable list of every voice across providers with favorites, presets, and filters.
- History grid
- The rest of the page. Every clip you've generated, newest first, with inline playback.
Voice providers
Three providers, ranked by quality:
| Provider | Tier | Strength |
|---|---|---|
| ElevenLabs | Professional | Highest quality. Full expressive control. |
| Kokoro (via Replicate) | Premium | High quality. Simpler controls. |
| Google Cloud TTS | Free | 60+ neural voices across 20+ languages. |
Picking a voice
The voice browser lists every available voice. Each row shows the voice name, provider, gender, language, accent, and style. Filter by provider or language. Click a voice to load it into the composer.
A star button on each row adds a voice to Favorites. Favorites live at the top of the browser for quick access.
Voice settings
ElevenLabs exposes the full expressive knob set:
- Speed
- 0.5x to 2x. 1x is natural.
- Stability
- 0-100. Higher is more consistent. Lower gives more expressive variation (with drift risk).
- Similarity boost
- 0-100. How strictly to hew to the base voice's identity. Higher is safer.
- Style
- 0-100. Emotional expressiveness. Zero is documentary-neutral. Higher is dramatic.
Google and Kokoro expose a smaller set (primarily speed). Sliders hidden when not applicable to the current voice.
Voice presets
Save a configured voice + slider settings as a named preset. Reuse across projects. Presets are scoped to you and do not leak to other users.
Presets cover recurring needs: a specific documentary narrator for every essay, a warm memoir voice for intimate stories, an energetic explainer voice for tutorials. Saved presets appear at the top of the voice browser alongside favorites.
Generating a clip
- Type the text into the composer. Up to 5,000 characters per clip.
- Pick a voice.
- Tune sliders if needed.
- Click Generate. The request is synchronous (1-3 seconds typical).
The new clip lands at the top of the history grid with a waveform preview and inline playback.
Clip history
Each tile in the history grid shows:
- The first 60 characters of the text.
- Duration.
- Voice name.
- Waveform preview (click to play).
The tile's menu lets you regenerate (with the original text and settings preloaded), download, add to a storyboard, or delete.
Adding a clip to a storyboard
Add to Storyboard in a clip's menu opens a picker with your storyboards and their scenes. Choose a target. The clip is copied into that storyboard's Library as an audio asset with origin = uploaded, ready to drag onto the timeline.
Cost
A flat rate of 20 credits per 1,000 characters, rounded up, minimum 1 credit. Applies to every provider. The generate button shows the estimated cost before you confirm.
Use Audio Gen to pre-roll narration before committing to a full storyboard. Generate your script as a single clip, play it back, and time the delivery. If the pacing is off, edit the script and regenerate. Cheap iteration before expensive image and video work.
Limits
- 5,000 characters per clip. Split long narration into multiple clips.
- No voice cloning. You cannot upload a reference voice to match.
- Language detection is automatic but imperfect. Write in the language you want the voice to speak.
Where to go next
- Narrators and voices for the scene-mode version of this flow.
- Audio and mixing for placing clips on the timeline.