AI generation is temporarily down. We're working to bring it back, sorry for the inconvenience. Your existing content is unaffected.

Docs Home /Models /Fawna Audio

⌘K

Fawna Audio

Audio is the Veo-family tier with native synced audio as the headline feature. Dialogue, music, ambient, all generated in sync with the video from the same prompt. The closest Fawna gets to a full scene in one generation.

Tier label: Audio
Engine: Veo 3.1 Fast (audio mode)
Price: From 20 credits per second
Aspect ratios: 16:9, 9:16
Resolutions: 720p, 1080p
Durations: 4, 6, 8 seconds
Audio: Always on, natively synced
Character refs: First-frame image only (I2V mode)
Style ref: Not supported
Keyframes: First and last frame
Negative prompt: Not supported
Magic Prompt: Off

When to pick Audio

A character speaks a line and you want it synced to mouth motion.
You want ambient atmosphere baked into the clip rather than added in post.
You want a short music cue that lands with the on-screen action.
You need a one-shot scene with sound for quick drafts or social content.

Directing the audio

Audio uses inline tags to separate what is said from what is heard. Three tags do most of the work:

Dialogue

Put the line in double quotes. Mention who says it.

SFX:

Sound effects tied to on-screen actions.

Ambient:

Background atmosphere.

Music:

Score or musical cues.

Example Dialogue with ambient and SFX

Medium close-up: a bearded fisherman in his fifties stands on a
dock at sunrise, looks at the camera, and says quietly, "We're
leaving with the tide." Handheld, slight sway. 50mm lens, soft
overcast light, muted blue-grey palette. SFX: wooden dock creak,
rope slap. Ambient: gentle surf, distant gull cries.

Voice direction

You can shape the voice with brief modifiers: young woman, elderly man, soft voice, whisper, shouting, sarcastic, accented. The model will do its best to match. It is not precise voice cloning and does not allow you to match a specific person's voice.

Example Shaped voice

A young woman with a quiet, determined voice says, "Not today."
Medium shot.

Strengths

True lip-sync. The mouth motion matches the audio.
Naturalistic ambient beds. Ocean, forest, city, rain all sound placed and convincing.
Same Veo photoreal quality as the Film family.
Short durations keep audio coherent (longer clips risk audio drift).

Where it struggles

Long monologues. Keep lines under ~10 words each. Multi-sentence dialogue gets compressed or cropped.
Music composition is serviceable but not production-grade. Use a real music track for hero pieces.
Voice cloning is not supported. Do not expect a specific actor's voice.
Three or more speakers in a single clip confuse the audio assignment. One or two speakers max.

Tips

Short lines, clear delivery. "We're leaving with the tide." reads better than a paragraph.
Layer tags. Dialogue plus SFX plus Ambient in one prompt gives the richest result.
Keep duration to 4-6 seconds for dialogue-heavy shots. Longer clips drift.
Use brackets to separate speakers: "She says, 'Ready?' He replies, 'Always.'" is clearer than a run-on.

Where to go next

Fawna Film family for the same photoreal engine with optional audio.
Audio Generation for standalone voice generation.

← Previous

Fawna Film family

Fawna Illustrate