Sora 2

Sora 2 is OpenAI’s video generation model. Compared to first-generation video models, Sora 2 produces noticeably stronger physical realism — cause-and-effect plays out plausibly, missed shots ricochet rather than teleport, and characters behave consistently across cuts. The Pro variant generates a synchronized audio track alongside the video.

Variants

Sora 2 ships in two variants on Prodia:

Variant	Resolution	Audio	Best for
Sora 2	Fixed (720p-class)	No	Rapid iteration, lower cost
Sora 2 Pro	`720p` or `1080p`	Yes (synchronized)	Final output, content with dialogue or sound design

Both variants accept text-to-video and image-to-video inputs and produce 4, 8, or 12 second clips.

What sets Sora 2 apart

Physical plausibility: Sora 2’s defining property is that it gets physics roughly right — basketballs miss the rim and bounce off the backboard, water settles, momentum carries through a gesture. Earlier text-to-video models tend to bend the world to satisfy the prompt, deleting or warping objects to make the requested outcome happen. Sora 2 is more willing to let an action fail, which produces more usable footage for narrative content.

Synchronized audio (Pro): Sora 2 Pro generates the audio track jointly with the video, so dialogue, footsteps, and ambient sound line up with what’s happening on screen. There’s no separate foley pass. Ambient cues described in the prompt (for example “rain on a metal roof”, “crowd chatter”) are produced as part of the same generation.

Steerable via prompt: The model responds well to detailed cinematographic direction — shot type, lens, lighting, camera motion — and to multi-shot prompts that specify a sequence of scenes within a single clip.

Image-to-video animation: The img2vid job types accept a still image and a motion prompt describing how the scene should evolve. Useful for animating product shots, character portraits, or storyboard frames.

When to use Sora 2

Narrative content with dialogue or sound — Sora 2 Pro is the right choice when you need an audio track baked in
Action sequences — sports, stunts, and physics-driven scenes benefit from Sora 2’s grounded behavior
Storyboard animation — animate a still image into a short clip without a separate foley step
Vertical and landscape social — both 16:9 and 9:16 are supported natively

For longer clips with audio control or audio-driven generation, see Wan 2.7. For the fastest video generation on Prodia (~22s), use Wan 2.2 Lightning. For precise camera choreography (dolly, pan, zoom presets) or motion masking, see Kling. For another joint audio-visual model with last-frame transition control, see Veo.

Job types

Job type	Description	Audio	Resolution
`inference.sora-2.txt2vid.v1`	Sora 2 text-to-video	No	Fixed
`inference.sora-2.img2vid.v1`	Sora 2 image-to-video	No	Fixed
`inference.sora-2.pro.txt2vid.v1`	Sora 2 Pro text-to-video	Yes	720p or 1080p
`inference.sora-2.pro.img2vid.v1`	Sora 2 Pro image-to-video	Yes	720p or 1080p

Parameters

Common to all Sora 2 job types:

prompt (required) — text description, 3 to 4,096 characters
aspect_ratio — 16:9 (default) or 9:16
duration — 4 (default), 8, or 12 seconds
seed — integer 1 to 2,147,483,647 for reproducible results

Pro variants only:

resolution — 720p (default) or 1080p

Image-to-video only:

image — input image filename to animate. The image is referenced from the multipart upload.

Prompting tips

Lead with action, not subject: “A cyclist sprints up a steep hill, pedals out of the saddle” produces better motion than “a cyclist on a hill”
Describe the soundscape (Pro): mention diegetic sound — “tires on gravel”, “wind through pines”, “low ambient room tone” — when using Sora 2 Pro
Cinematographic direction works: terms like “handheld”, “tracking shot”, “rack focus”, “shallow depth of field”, “golden hour” are interpreted as expected
Animation prompts (img2vid): describe how the existing scene should change rather than re-describing it — “the camera dollies in slowly as the subject turns to face it”
Use seeds for iteration: hold the seed constant while you tweak the prompt to see how each phrase changes the output

Examples

Text-to-video (Sora 2 standard):

{
  "type": "inference.sora-2.txt2vid.v1",
  "config": {
    "prompt": "A close-up cinematic shot of a golden retriever puppy bounding through a field of wildflowers at sunrise, soft warm light, slow motion",
    "aspect_ratio": "16:9",
    "duration": 4
  }
}

Text-to-video with audio (Sora 2 Pro at 1080p):

{
  "type": "inference.sora-2.pro.txt2vid.v1",
  "config": {
    "prompt": "A barista in an empty cafe pulls an espresso shot at golden hour. The grinder hums, steam hisses, the milk pitcher clinks against the bar. Shallow depth of field, warm window light.",
    "resolution": "1080p",
    "aspect_ratio": "16:9",
    "duration": 8
  }
}

Image-to-video animation:

{
  "type": "inference.sora-2.img2vid.v1",
  "config": {
    "image": "product-shot.jpg",
    "prompt": "Slow turntable rotation, soft studio lighting, the product turns to reveal the back face",
    "aspect_ratio": "16:9",
    "duration": 4
  }
}

Guides

Generating Videos Step-by-step guide for generating videos with Prodia, including text-to-video and image-to-video examples.