Skip to content

Veo

Veo is Google DeepMind’s video generation model. Its defining feature is joint audio-visual generation — rather than generating video first and adding sound as a separate step, Veo processes both modalities together using a joint diffusion process. This means generated audio syncs naturally with on-screen actions, with dialogue matching lip movements with under 120ms accuracy.

Veo uses a 3D latent diffusion transformer architecture that goes beyond 2D image generation by adding time as a third dimension. The model uses 3D convolutional layers to process spatiotemporal data across channels, time, height, and width simultaneously, enabling it to extract patterns not just across space but also across time.

The video data is compressed into spatio-temporal patches in the latent space, making generation efficient while maintaining high visual quality. This same architecture powers the audio generation when generate_audio is enabled.

Veo is available in two speed tiers:

ModeGeneration timeBest for
Standard~90 secondsMaximum quality, commercial content
Fast~60 secondsRapid iteration, previews, high-volume workloads

Both modes support the same resolution, aspect ratio, and feature set. The quality difference is subtle — start with Fast for prototyping and switch to Standard for final output.

Joint audio-visual generation: Enable generate_audio: true to produce a synchronized audio track alongside the video. The model generates audio that matches on-screen actions — footsteps sync with walking, dialogue matches lip movements, ambient sounds match the environment. This eliminates the need for a separate audio generation or foley step.

Negative prompts: Veo supports negative_prompt to exclude specific content from generation. This is useful for avoiding common artifacts: "low quality, blurry, distorted faces, watermark".

Last-frame control (img2vid): For image-to-video generation, you can provide both a starting image and a target last frame. The model generates a smooth transition between the two, useful for morphing effects and controlled scene transitions.

Person generation policy: The person_generation parameter lets you explicitly allow or disallow the generation of people, giving you control over content policy compliance.

  • Video with sound — the only model on Prodia with integrated audio generation
  • Landscape and nature content — excels at sweeping shots, atmospheric scenes, and environmental video
  • Social media video — 16:9 and 9:16 aspect ratios cover YouTube, TikTok, and Instagram
  • Talking-head content — joint audio-visual diffusion produces natural lip sync

For longer videos (up to 15s), more aspect ratios, or video continuation, consider Wan 2.7. For fast generation at lower cost, Wan 2.2 Lightning generates in ~22s. For precise camera control, Kling offers programmatic camera movements.

Job typeDescriptionETA
inference.veo.txt2vid.v2Generate a video from text~90s
inference.veo.img2vid.v2Generate a video from an image~90s
inference.veo.fast.txt2vid.v2Fast text-to-video generation~60s
inference.veo.fast.img2vid.v2Fast image-to-video generation~60s

Common to all:

  • prompt (required) — text description, up to 2,500 characters
  • negative_prompt — content to exclude, up to 2,500 characters
  • resolution720p (default) or 1080p
  • aspect_ratio16:9 (default) or 9:16
  • duration_seconds4, 6, or 8 (default)
  • generate_audio — set to true to generate a synchronized audio track (default: false)
  • person_generationallow_adult (default) or dont_allow
  • seed — integer for reproducible results

Image-to-video only:

  • image — input image filename to use as the first frame
  • last_frame — optional target last-frame image for controlled transitions
  • Describe the soundscape: when using generate_audio, include audio cues in your prompt — “birds chirping in a forest”, “footsteps echoing in a hallway”, “crowd cheering in a stadium”
  • Cinematic language works well: terms like “tracking shot”, “slow motion”, “dolly zoom”, “aerial view” produce expected camera behaviors
  • Use negative prompts: adding "low quality, blurry, distorted, watermark" as a negative prompt consistently improves output
  • Match aspect ratio to platform: 16:9 for YouTube/landscape, 9:16 for TikTok/Reels/Shorts

Text-to-video with audio:

{
"type": "inference.veo.fast.txt2vid.v2",
"config": {
"prompt": "A sweeping mountain landscape at sunrise, mist rolling through valleys, birds flying overhead, cinematic HDR",
"resolution": "1080p",
"aspect_ratio": "16:9",
"duration_seconds": 8,
"generate_audio": true,
"negative_prompt": "low quality, blurry, watermark"
}
}

Image-to-video with last-frame control:

{
"type": "inference.veo.img2vid.v2",
"config": {
"prompt": "Smooth transition from day to night, lights gradually turning on",
"image": "daytime-city.jpg",
"last_frame": "nighttime-city.jpg",
"resolution": "1080p",
"aspect_ratio": "16:9",
"duration_seconds": 6
}
}