Skip to content

Wan 2.2 Lightning

Wan 2.2 Lightning is Alibaba’s fast video generation model from the Wan family. It generates video in around 22 seconds — making it practical for interactive applications and high-volume pipelines.

Wan 2.2 Lightning uses 14B active parameters in a Diffusion Transformer (DiT) architecture with three key components:

  • T5 text encoder — encodes multilingual text input with cross-attention in each transformer block
  • Spatiotemporal 3D VAE — compresses video frames simultaneously across space and time, dramatically reducing compute requirements
  • DiT backbone — processes the compressed latent space with shared MLP modules across transformer blocks

Wan 2.2 Lightning generates video in just 4 diffusion steps without requiring classifier-free guidance (CFG), enabling fast generation while maintaining strong visual quality.

  • Social media content — fast turnaround for short-form video at 720p
  • Rapid prototyping — quickly test video concepts before committing to longer, higher-quality generation
  • High-volume pipelines — the ~22s generation time makes batch processing practical
  • Image animation — bring product shots, illustrations, or photos to life with the img2vid mode

For higher resolution (1080p), longer duration (up to 15s), or features like audio-driven lip sync and video continuation, consider Wan 2.7 instead.

Job typeDescription
inference.wan2-2.lightning.txt2vid.v0Generate a video from a text prompt
inference.wan2-2.lightning.img2vid.v0Generate a video from an input image and prompt
  • prompt (required) — text description of the video to generate, up to 2,500 characters
  • resolution — output resolution: 720p (default, 1280x720) or 480p (832x480)
  • seed — integer for reproducible results
  • image (img2vid only) — input image filename to animate

Wan 2.2 Lightning responds well to specific, action-oriented prompts. Include details about movement, camera angle, and visual style:

  • Be specific about motion: “A cat walking slowly through a garden” works better than “a cat in a garden”
  • Include visual style cues: “cinematic lighting”, “slow motion”, “4k” help guide quality
  • Describe camera movement: “tracking shot”, “pan left”, “aerial view” improve spatial coherence
  • Keep it concise: the model performs best with focused, clear prompts rather than long descriptions

Text-to-video:

{
"type": "inference.wan2-2.lightning.txt2vid.v0",
"config": {
"prompt": "Two anthropomorphic cats boxing on a spotlighted stage, cinematic lighting, dynamic camera angles",
"resolution": "720p"
}
}

Image-to-video (animate a still image):

{
"type": "inference.wan2-2.lightning.img2vid.v0",
"config": {
"prompt": "The person slowly turns their head and smiles, natural movement",
"image": "portrait.jpg",
"resolution": "720p"
}
}