Skip to content

Wan 2.7

Wan 2.7 is the most capable release in Alibaba’s Wan model family — a single unified model that handles five generation tasks: text-to-image, image-to-image editing, text-to-video, image-to-video animation, and video-to-video continuation. It is the go-to choice when you need maximum flexibility and quality across both image and video modalities.

Wan 2.7 builds on the DiT (Diffusion Transformer) foundation established in the Wan 2.x series with a refined Flow-Matching framework. Key architectural improvements over earlier versions include:

  • Improved motion coherence — reduced temporal flickering on skin, fabric, and moving objects through better physics-based motion consistency
  • Bilingual T5 encoder — native support for both Chinese and English prompts with cross-attention in every transformer block
  • Prompt extension — an intelligent prompt rewriting system that expands short prompts into detailed scene descriptions for better results (enabled by default, can be disabled)
  • Thinking mode — for image generation, an optional reasoning step that improves composition and detail at the cost of longer generation time

Wan 2.7 is uniquely versatile among the models on Prodia:

  • Five generation modes in one model — no other model covers txt2img, img2img, txt2vid, img2vid, and vid2vid in a single family
  • Audio-driven video — provide a WAV or MP3 file to drive lip-sync and motion timing in img2vid, useful for talking-head videos and music-driven content
  • First and last frame control — specify both the starting and ending frames of a video, and the model generates everything in between. This enables loopable videos and precise scene transitions
  • Video continuation — extend an existing video clip with vid2vid, maintaining visual consistency while adding new content
  • 1080p video output — one of the few open video models supporting full HD generation
  • Up to 15-second clips — longer durations than most competing video models
Use caseJob typeWhy Wan 2.7
High-quality image generationtxt2imgThinking mode produces detailed, well-composed images at up to 2K
Image style transferimg2imgEdit or restyle photos with bilingual prompts
Cinematic video from texttxt2vid1080p output with 5 aspect ratios and up to 15s duration
Talking-head videosimg2vid with audioAudio-driven lip-sync from a portrait photo
Product animationsimg2vidAnimate product shots with controlled motion
Loopable social contentimg2vid with last_frameSet first = last frame for seamless loops
Scene extensionsvid2vidContinue an existing clip naturally

For faster video generation at 720p where you don’t need audio, frame control, or 1080p — Wan 2.2 Lightning generates in ~22 seconds vs Wan 2.7’s ~200 seconds.

Job typeDescriptionETA
inference.wan2-7.txt2img.v1Generate an image from text~40s
inference.wan2-7.img2img.v1Edit or restyle an existing image~40s
inference.wan2-7.txt2vid.v1Generate a video from text~200s
inference.wan2-7.img2vid.v1Animate an image into a video~200s
inference.wan2-7.vid2vid.v1Continue or extend a video clip~200s

Common to all job types:

  • prompt — text description, up to 5,000 characters (required for most types)
  • seed — integer 0–2147483647 for reproducible results

Image generation (txt2img, img2img):

  • size1K (~1024x1024) or 2K (~2048x2048, default)
  • thinking_mode — enable reasoning for improved composition (txt2img only, default: true)
  • image — input image filename (img2img only)

Video generation (txt2vid, img2vid, vid2vid):

  • resolution720P (default) or 1080P
  • ratio — aspect ratio: 16:9 (default), 9:16, 1:1, 4:3, 3:4 (txt2vid only)
  • duration — video length in seconds, 2–15 (default: 5)
  • negative_prompt — content to exclude, up to 500 characters
  • prompt_extend — intelligent prompt rewriting (default: true)

Image-to-video additional parameters:

  • image — first-frame image to animate
  • last_frame — target last-frame image for start-end interpolation
  • audio — driving audio for lip-sync and motion timing (WAV/MP3, 2–30s, max 15 MB)

Video-to-video additional parameters:

  • video — input video clip to continue (MP4/MOV, 2–10s, max 100 MB)
  • last_frame — target last-frame image to guide the continuation endpoint
  • Use prompt extension: leave prompt_extend enabled (default) for short prompts — the model will expand them into detailed scene descriptions that produce better results
  • Negative prompts matter for video: adding "low resolution, error, worst quality, deformed" as a negative_prompt noticeably improves video quality
  • Bilingual prompts: you can mix Chinese and English in the same prompt for nuanced descriptions
  • Aspect ratios: match your target platform — 9:16 for TikTok/Reels, 16:9 for YouTube, 1:1 for Instagram posts
  • Audio-driven video: for best lip-sync results, use clear speech audio without background music. Audio longer than the video duration is automatically trimmed

Text-to-video at 1080p:

{
"type": "inference.wan2-7.txt2vid.v1",
"config": {
"prompt": "A kitten running in the moonlight",
"resolution": "1080P",
"ratio": "16:9",
"duration": 5,
"negative_prompt": "low resolution, error, worst quality, deformed"
}
}

Audio-driven talking head:

{
"type": "inference.wan2-7.img2vid.v1",
"config": {
"image": "portrait.jpg",
"audio": "speech.mp3",
"prompt": "A person speaking naturally, subtle head movement",
"resolution": "720P",
"duration": 10
}
}

Text-to-image with thinking mode:

{
"type": "inference.wan2-7.txt2img.v1",
"config": {
"prompt": "A serene mountain landscape at sunrise with vibrant colors",
"size": "2K",
"thinking_mode": true
}
}