Veo

Veo is Google DeepMind’s video generation model. Its defining feature is joint audio-visual generation — rather than generating video first and adding sound as a separate step, Veo processes both modalities together using a joint diffusion process. This means generated audio syncs naturally with on-screen actions, with dialogue matching lip movements with under 120ms accuracy.

Architecture

Veo uses a 3D latent diffusion transformer architecture that goes beyond 2D image generation by adding time as a third dimension. The model uses 3D convolutional layers to process spatiotemporal data across channels, time, height, and width simultaneously, enabling it to extract patterns not just across space but also across time.

The video data is compressed into spatio-temporal patches in the latent space, making generation efficient while maintaining high visual quality. This same architecture powers the audio generation when generate_audio is enabled.

Standard vs Fast mode

Veo is available in two speed tiers:

Mode	Generation time	Best for
Standard	~90 seconds	Maximum quality, commercial content
Fast	~60 seconds	Rapid iteration, previews, high-volume workloads

Both modes support the same resolution, aspect ratio, and feature set. The quality difference is subtle — start with Fast for prototyping and switch to Standard for final output.

What sets Veo apart

Joint audio-visual generation: Enable generate_audio: true to produce a synchronized audio track alongside the video. The model generates audio that matches on-screen actions — footsteps sync with walking, dialogue matches lip movements, ambient sounds match the environment. This eliminates the need for a separate audio generation or foley step.

Negative prompts: Veo supports negative_prompt to exclude specific content from generation. This is useful for avoiding common artifacts: "low quality, blurry, distorted faces, watermark".

Last-frame control (img2vid): For image-to-video generation, you can provide both a starting image and a target last frame. The model generates a smooth transition between the two, useful for morphing effects and controlled scene transitions.

Person generation policy: The person_generation parameter lets you explicitly allow or disallow the generation of people, giving you control over content policy compliance.

When to use Veo

Video with sound — the only model on Prodia with integrated audio generation
Landscape and nature content — excels at sweeping shots, atmospheric scenes, and environmental video
Social media video — 16:9 and 9:16 aspect ratios cover YouTube, TikTok, and Instagram
Talking-head content — joint audio-visual diffusion produces natural lip sync

For longer videos (up to 15s), more aspect ratios, or video continuation, consider Wan 2.7. For fast generation at lower cost, Wan 2.2 Lightning generates in ~22s. For precise camera control, Kling offers programmatic camera movements.

Job types

Job type	Description	ETA
`inference.veo.txt2vid.v2`	Generate a video from text	~90s
`inference.veo.img2vid.v2`	Generate a video from an image	~90s
`inference.veo.fast.txt2vid.v2`	Fast text-to-video generation	~60s
`inference.veo.fast.img2vid.v2`	Fast image-to-video generation	~60s

Parameters

Common to all:

prompt (required) — text description, up to 2,500 characters
negative_prompt — content to exclude, up to 2,500 characters
resolution — 720p (default) or 1080p
aspect_ratio — 16:9 (default) or 9:16
duration_seconds — 4, 6, or 8 (default)
generate_audio — set to true to generate a synchronized audio track (default: false)
person_generation — allow_adult (default) or dont_allow
seed — integer for reproducible results

Image-to-video only:

image — input image filename to use as the first frame
last_frame — optional target last-frame image for controlled transitions

Prompting tips

Describe the soundscape: when using generate_audio, include audio cues in your prompt — “birds chirping in a forest”, “footsteps echoing in a hallway”, “crowd cheering in a stadium”
Cinematic language works well: terms like “tracking shot”, “slow motion”, “dolly zoom”, “aerial view” produce expected camera behaviors
Use negative prompts: adding "low quality, blurry, distorted, watermark" as a negative prompt consistently improves output
Match aspect ratio to platform: 16:9 for YouTube/landscape, 9:16 for TikTok/Reels/Shorts

Examples

Text-to-video with audio:

{
  "type": "inference.veo.fast.txt2vid.v2",
  "config": {
    "prompt": "A sweeping mountain landscape at sunrise, mist rolling through valleys, birds flying overhead, cinematic HDR",
    "resolution": "1080p",
    "aspect_ratio": "16:9",
    "duration_seconds": 8,
    "generate_audio": true,
    "negative_prompt": "low quality, blurry, watermark"
  }
}

Image-to-video with last-frame control:

{
  "type": "inference.veo.img2vid.v2",
  "config": {
    "prompt": "Smooth transition from day to night, lights gradually turning on",
    "image": "daytime-city.jpg",
    "last_frame": "nighttime-city.jpg",
    "resolution": "1080p",
    "aspect_ratio": "16:9",
    "duration_seconds": 6
  }
}

Guides

Generating Videos Step-by-step guide for generating videos with Prodia, including text-to-video and image-to-video examples.