Wan 2.7
Wan 2.7 is the most capable release in Alibaba’s Wan model family — a single unified model that handles five generation tasks: text-to-image, image-to-image editing, text-to-video, image-to-video animation, and video-to-video continuation. It is the go-to choice when you need maximum flexibility and quality across both image and video modalities.
Architecture
Section titled “Architecture”Wan 2.7 builds on the DiT (Diffusion Transformer) foundation established in the Wan 2.x series with a refined Flow-Matching framework. Key architectural improvements over earlier versions include:
- Improved motion coherence — reduced temporal flickering on skin, fabric, and moving objects through better physics-based motion consistency
- Bilingual T5 encoder — native support for both Chinese and English prompts with cross-attention in every transformer block
- Prompt extension — an intelligent prompt rewriting system that expands short prompts into detailed scene descriptions for better results (enabled by default, can be disabled)
- Thinking mode — for image generation, an optional reasoning step that improves composition and detail at the cost of longer generation time
What sets Wan 2.7 apart
Section titled “What sets Wan 2.7 apart”Wan 2.7 is uniquely versatile among the models on Prodia:
- Five generation modes in one model — no other model covers txt2img, img2img, txt2vid, img2vid, and vid2vid in a single family
- Audio-driven video — provide a WAV or MP3 file to drive lip-sync and motion timing in img2vid, useful for talking-head videos and music-driven content
- First and last frame control — specify both the starting and ending frames of a video, and the model generates everything in between. This enables loopable videos and precise scene transitions
- Video continuation — extend an existing video clip with vid2vid, maintaining visual consistency while adding new content
- 1080p video output — one of the few open video models supporting full HD generation
- Up to 15-second clips — longer durations than most competing video models
When to use Wan 2.7
Section titled “When to use Wan 2.7”| Use case | Job type | Why Wan 2.7 |
|---|---|---|
| High-quality image generation | txt2img | Thinking mode produces detailed, well-composed images at up to 2K |
| Image style transfer | img2img | Edit or restyle photos with bilingual prompts |
| Cinematic video from text | txt2vid | 1080p output with 5 aspect ratios and up to 15s duration |
| Talking-head videos | img2vid with audio | Audio-driven lip-sync from a portrait photo |
| Product animations | img2vid | Animate product shots with controlled motion |
| Loopable social content | img2vid with last_frame | Set first = last frame for seamless loops |
| Scene extensions | vid2vid | Continue an existing clip naturally |
For faster video generation at 720p where you don’t need audio, frame control, or 1080p — Wan 2.2 Lightning generates in ~22 seconds vs Wan 2.7’s ~200 seconds.
Job types
Section titled “Job types”| Job type | Description | ETA |
|---|---|---|
inference.wan2-7.txt2img.v1 | Generate an image from text | ~40s |
inference.wan2-7.img2img.v1 | Edit or restyle an existing image | ~40s |
inference.wan2-7.txt2vid.v1 | Generate a video from text | ~200s |
inference.wan2-7.img2vid.v1 | Animate an image into a video | ~200s |
inference.wan2-7.vid2vid.v1 | Continue or extend a video clip | ~200s |
Parameters
Section titled “Parameters”Common to all job types:
prompt— text description, up to 5,000 characters (required for most types)seed— integer 0–2147483647 for reproducible results
Image generation (txt2img, img2img):
size—1K(~1024x1024) or2K(~2048x2048, default)thinking_mode— enable reasoning for improved composition (txt2img only, default: true)image— input image filename (img2img only)
Video generation (txt2vid, img2vid, vid2vid):
resolution—720P(default) or1080Pratio— aspect ratio:16:9(default),9:16,1:1,4:3,3:4(txt2vid only)duration— video length in seconds, 2–15 (default: 5)negative_prompt— content to exclude, up to 500 charactersprompt_extend— intelligent prompt rewriting (default: true)
Image-to-video additional parameters:
image— first-frame image to animatelast_frame— target last-frame image for start-end interpolationaudio— driving audio for lip-sync and motion timing (WAV/MP3, 2–30s, max 15 MB)
Video-to-video additional parameters:
video— input video clip to continue (MP4/MOV, 2–10s, max 100 MB)last_frame— target last-frame image to guide the continuation endpoint
Prompting tips
Section titled “Prompting tips”- Use prompt extension: leave
prompt_extendenabled (default) for short prompts — the model will expand them into detailed scene descriptions that produce better results - Negative prompts matter for video: adding
"low resolution, error, worst quality, deformed"as anegative_promptnoticeably improves video quality - Bilingual prompts: you can mix Chinese and English in the same prompt for nuanced descriptions
- Aspect ratios: match your target platform —
9:16for TikTok/Reels,16:9for YouTube,1:1for Instagram posts - Audio-driven video: for best lip-sync results, use clear speech audio without background music. Audio longer than the video duration is automatically trimmed
Examples
Section titled “Examples”Text-to-video at 1080p:
{ "type": "inference.wan2-7.txt2vid.v1", "config": { "prompt": "A kitten running in the moonlight", "resolution": "1080P", "ratio": "16:9", "duration": 5, "negative_prompt": "low resolution, error, worst quality, deformed" }}Audio-driven talking head:
{ "type": "inference.wan2-7.img2vid.v1", "config": { "image": "portrait.jpg", "audio": "speech.mp3", "prompt": "A person speaking naturally, subtle head movement", "resolution": "720P", "duration": 10 }}Text-to-image with thinking mode:
{ "type": "inference.wan2-7.txt2img.v1", "config": { "prompt": "A serene mountain landscape at sunrise with vibrant colors", "size": "2K", "thinking_mode": true }}