Wan 2.7

Wan 2.7 is the most capable release in Alibaba’s Wan model family — a single unified model that handles five generation tasks: text-to-image, image-to-image editing, text-to-video, image-to-video animation, and video-to-video continuation. It is the go-to choice when you need maximum flexibility and quality across both image and video modalities.

Architecture

Wan 2.7 builds on the DiT (Diffusion Transformer) foundation established in the Wan 2.x series with a refined Flow-Matching framework. Key architectural improvements over earlier versions include:

Improved motion coherence — reduced temporal flickering on skin, fabric, and moving objects through better physics-based motion consistency
Bilingual T5 encoder — native support for both Chinese and English prompts with cross-attention in every transformer block
Prompt extension — an intelligent prompt rewriting system that expands short prompts into detailed scene descriptions for better results (enabled by default, can be disabled)
Thinking mode — for image generation, an optional reasoning step that improves composition and detail at the cost of longer generation time

What sets Wan 2.7 apart

Wan 2.7 is uniquely versatile among the models on Prodia:

Five generation modes in one model — no other model covers txt2img, img2img, txt2vid, img2vid, and vid2vid in a single family
Audio-driven video — provide a WAV or MP3 file to drive lip-sync and motion timing in img2vid, useful for talking-head videos and music-driven content
First and last frame control — specify both the starting and ending frames of a video, and the model generates everything in between. This enables loopable videos and precise scene transitions
Video continuation — extend an existing video clip with vid2vid, maintaining visual consistency while adding new content
1080p video output — one of the few open video models supporting full HD generation
Up to 15-second clips — longer durations than most competing video models

When to use Wan 2.7

Use case	Job type	Why Wan 2.7
High-quality image generation	`txt2img`	Thinking mode produces detailed, well-composed images at up to 2K
Image style transfer	`img2img`	Edit or restyle photos with bilingual prompts
Cinematic video from text	`txt2vid`	1080p output with 5 aspect ratios and up to 15s duration
Talking-head videos	`img2vid` with `audio`	Audio-driven lip-sync from a portrait photo
Product animations	`img2vid`	Animate product shots with controlled motion
Loopable social content	`img2vid` with `last_frame`	Set first = last frame for seamless loops
Scene extensions	`vid2vid`	Continue an existing clip naturally

For faster video generation at 720p where you don’t need audio, frame control, or 1080p — Wan 2.2 Lightning generates in ~22 seconds vs Wan 2.7’s ~200 seconds.

Job types

Job type	Description	ETA
`inference.wan2-7.txt2img.v1`	Generate an image from text	~40s
`inference.wan2-7.img2img.v1`	Edit or restyle an existing image	~40s
`inference.wan2-7.txt2vid.v1`	Generate a video from text	~200s
`inference.wan2-7.img2vid.v1`	Animate an image into a video	~200s
`inference.wan2-7.vid2vid.v1`	Continue or extend a video clip	~200s

Parameters

Common to all job types:

prompt — text description, up to 5,000 characters (required for most types)
seed — integer 0–2147483647 for reproducible results

Image generation (txt2img, img2img):

size — 1K (~1024x1024) or 2K (~2048x2048, default)
thinking_mode — enable reasoning for improved composition (txt2img only, default: true)
image — input image filename (img2img only)

Video generation (txt2vid, img2vid, vid2vid):

resolution — 720P (default) or 1080P
ratio — aspect ratio: 16:9 (default), 9:16, 1:1, 4:3, 3:4 (txt2vid only)
duration — video length in seconds, 2–15 (default: 5)
negative_prompt — content to exclude, up to 500 characters
prompt_extend — intelligent prompt rewriting (default: true)

Image-to-video additional parameters:

image — first-frame image to animate
last_frame — target last-frame image for start-end interpolation
audio — driving audio for lip-sync and motion timing (WAV/MP3, 2–30s, max 15 MB)

Video-to-video additional parameters:

video — input video clip to continue (MP4/MOV, 2–10s, max 100 MB)
last_frame — target last-frame image to guide the continuation endpoint

Prompting tips

Use prompt extension: leave prompt_extend enabled (default) for short prompts — the model will expand them into detailed scene descriptions that produce better results
Negative prompts matter for video: adding "low resolution, error, worst quality, deformed" as a negative_prompt noticeably improves video quality
Bilingual prompts: you can mix Chinese and English in the same prompt for nuanced descriptions
Aspect ratios: match your target platform — 9:16 for TikTok/Reels, 16:9 for YouTube, 1:1 for Instagram posts
Audio-driven video: for best lip-sync results, use clear speech audio without background music. Audio longer than the video duration is automatically trimmed

Examples

Text-to-video at 1080p:

{
  "type": "inference.wan2-7.txt2vid.v1",
  "config": {
    "prompt": "A kitten running in the moonlight",
    "resolution": "1080P",
    "ratio": "16:9",
    "duration": 5,
    "negative_prompt": "low resolution, error, worst quality, deformed"
  }
}

Audio-driven talking head:

{
  "type": "inference.wan2-7.img2vid.v1",
  "config": {
    "image": "portrait.jpg",
    "audio": "speech.mp3",
    "prompt": "A person speaking naturally, subtle head movement",
    "resolution": "720P",
    "duration": 10
  }
}

Text-to-image with thinking mode:

{
  "type": "inference.wan2-7.txt2img.v1",
  "config": {
    "prompt": "A serene mountain landscape at sunrise with vibrant colors",
    "size": "2K",
    "thinking_mode": true
  }
}

Guides

Generating Images Step-by-step guide for generating images with Prodia.

Generating Videos Step-by-step guide for text-to-video and image-to-video generation.