Skip to content

Sora 2

Sora 2 is OpenAI’s video generation model. Compared to first-generation video models, Sora 2 produces noticeably stronger physical realism — cause-and-effect plays out plausibly, missed shots ricochet rather than teleport, and characters behave consistently across cuts. The Pro variant generates a synchronized audio track alongside the video.

Sora 2 ships in two variants on Prodia:

VariantResolutionAudioBest for
Sora 2Fixed (720p-class)NoRapid iteration, lower cost
Sora 2 Pro720p or 1080pYes (synchronized)Final output, content with dialogue or sound design

Both variants accept text-to-video and image-to-video inputs and produce 4, 8, or 12 second clips.

Physical plausibility: Sora 2’s defining property is that it gets physics roughly right — basketballs miss the rim and bounce off the backboard, water settles, momentum carries through a gesture. Earlier text-to-video models tend to bend the world to satisfy the prompt, deleting or warping objects to make the requested outcome happen. Sora 2 is more willing to let an action fail, which produces more usable footage for narrative content.

Synchronized audio (Pro): Sora 2 Pro generates the audio track jointly with the video, so dialogue, footsteps, and ambient sound line up with what’s happening on screen. There’s no separate foley pass. Ambient cues described in the prompt (for example “rain on a metal roof”, “crowd chatter”) are produced as part of the same generation.

Steerable via prompt: The model responds well to detailed cinematographic direction — shot type, lens, lighting, camera motion — and to multi-shot prompts that specify a sequence of scenes within a single clip.

Image-to-video animation: The img2vid job types accept a still image and a motion prompt describing how the scene should evolve. Useful for animating product shots, character portraits, or storyboard frames.

  • Narrative content with dialogue or sound — Sora 2 Pro is the right choice when you need an audio track baked in
  • Action sequences — sports, stunts, and physics-driven scenes benefit from Sora 2’s grounded behavior
  • Storyboard animation — animate a still image into a short clip without a separate foley step
  • Vertical and landscape social — both 16:9 and 9:16 are supported natively

For longer clips with audio control or audio-driven generation, see Wan 2.7. For the fastest video generation on Prodia (~22s), use Wan 2.2 Lightning. For precise camera choreography (dolly, pan, zoom presets) or motion masking, see Kling. For another joint audio-visual model with last-frame transition control, see Veo.

Job typeDescriptionAudioResolution
inference.sora-2.txt2vid.v1Sora 2 text-to-videoNoFixed
inference.sora-2.img2vid.v1Sora 2 image-to-videoNoFixed
inference.sora-2.pro.txt2vid.v1Sora 2 Pro text-to-videoYes720p or 1080p
inference.sora-2.pro.img2vid.v1Sora 2 Pro image-to-videoYes720p or 1080p

Common to all Sora 2 job types:

  • prompt (required) — text description, 3 to 4,096 characters
  • aspect_ratio16:9 (default) or 9:16
  • duration4 (default), 8, or 12 seconds
  • seed — integer 1 to 2,147,483,647 for reproducible results

Pro variants only:

  • resolution720p (default) or 1080p

Image-to-video only:

  • image — input image filename to animate. The image is referenced from the multipart upload.
  • Lead with action, not subject: “A cyclist sprints up a steep hill, pedals out of the saddle” produces better motion than “a cyclist on a hill”
  • Describe the soundscape (Pro): mention diegetic sound — “tires on gravel”, “wind through pines”, “low ambient room tone” — when using Sora 2 Pro
  • Cinematographic direction works: terms like “handheld”, “tracking shot”, “rack focus”, “shallow depth of field”, “golden hour” are interpreted as expected
  • Animation prompts (img2vid): describe how the existing scene should change rather than re-describing it — “the camera dollies in slowly as the subject turns to face it”
  • Use seeds for iteration: hold the seed constant while you tweak the prompt to see how each phrase changes the output

Text-to-video (Sora 2 standard):

{
"type": "inference.sora-2.txt2vid.v1",
"config": {
"prompt": "A close-up cinematic shot of a golden retriever puppy bounding through a field of wildflowers at sunrise, soft warm light, slow motion",
"aspect_ratio": "16:9",
"duration": 4
}
}

Text-to-video with audio (Sora 2 Pro at 1080p):

{
"type": "inference.sora-2.pro.txt2vid.v1",
"config": {
"prompt": "A barista in an empty cafe pulls an espresso shot at golden hour. The grinder hums, steam hisses, the milk pitcher clinks against the bar. Shallow depth of field, warm window light.",
"resolution": "1080p",
"aspect_ratio": "16:9",
"duration": 8
}
}

Image-to-video animation:

{
"type": "inference.sora-2.img2vid.v1",
"config": {
"image": "product-shot.jpg",
"prompt": "Slow turntable rotation, soft studio lighting, the product turns to reveal the back face",
"aspect_ratio": "16:9",
"duration": 4
}
}