AI generation is temporarily paused.
Image, video, and audio generation are disabled by an administrator. Existing content is unaffected.
Fawna Audio
Audio is the Veo-family tier with native synced audio as the headline feature. Dialogue, music, ambient, all generated in sync with the video from the same prompt. The closest Fawna gets to a full scene in one generation.
- Tier label
- Audio
- Engine
- Google Veo 3.1 Fast with audio mode
- Price
- From 20 credits per second
- Aspect ratios
- 16:9, 9:16
- Resolutions
- 720p, 1080p
- Durations
- 4, 6, 8 seconds
- Audio
- Always on, natively synced
- Character refs
- First-frame image only (I2V mode)
- Style ref
- Not supported
- Keyframes
- First and last frame
- Negative prompt
- Not supported
- Magic Prompt
- Off
When to pick Audio
- A character speaks a line and you want it synced to mouth motion.
- You want ambient atmosphere baked into the clip rather than added in post.
- You want a short music cue that lands with the on-screen action.
- You need a one-shot scene with sound for quick drafts or social content.
Directing the audio
Audio uses inline tags to separate what is said from what is heard. Three tags do most of the work:
Example
Dialogue with ambient and SFX
Medium close-up: a bearded fisherman in his fifties stands on a dock at sunrise, looks at the camera, and says quietly, "We're leaving with the tide." Handheld, slight sway. 50mm lens, soft overcast light, muted blue-grey palette. SFX: wooden dock creak, rope slap. Ambient: gentle surf, distant gull cries.
Voice direction
You can shape the voice with brief modifiers: young woman, elderly man, soft voice, whisper, shouting, sarcastic, accented. The model will do its best to match. It is not precise voice cloning and does not allow you to match a specific person's voice.
Example
Shaped voice
A young woman with a quiet, determined voice says, "Not today." Medium shot.
Strengths
- True lip-sync. The mouth motion matches the audio.
- Naturalistic ambient beds. Ocean, forest, city, rain all sound placed and convincing.
- Same Veo photoreal quality as the Film family.
- Short durations keep audio coherent (longer clips risk audio drift).
Where it struggles
- Long monologues. Keep lines under ~10 words each. Multi-sentence dialogue gets compressed or cropped.
- Music composition is serviceable but not production-grade. Use a real music track for hero pieces.
- Voice cloning is not supported. Do not expect a specific actor's voice.
- Three or more speakers in a single clip confuse the audio assignment. One or two speakers max.
Tips
- Short lines, clear delivery. "We're leaving with the tide." reads better than a paragraph.
- Layer tags. Dialogue plus SFX plus Ambient in one prompt gives the richest result.
- Keep duration to 4-6 seconds for dialogue-heavy shots. Longer clips drift.
- Use brackets to separate speakers: "She says, 'Ready?' He replies, 'Always.'" is clearer than a run-on.
Where to go next
- Fawna Film family for the same photoreal engine with optional audio.
- Audio Generation for standalone voice generation.