Kling
Kling is a video generation model family from Kuaishou Technology, the company behind the Kwai short-video platform. Among the video models available on Prodia, Kling stands out for its precise camera control system and motion masking capabilities — features that give you fine-grained control over how subjects and camera move within generated videos.
Architecture
Section titled “Architecture”Kling uses a Diffusion Transformer (DiT) architecture with two proprietary innovations:
- Custom 3D VAE — Kuaishou’s self-developed spatiotemporal variational autoencoder compresses video data simultaneously across space and time. This approach achieves high reconstruction quality while keeping training efficient — a better balance than encoding spatial and temporal dimensions separately
- Full-attention spatiotemporal mechanism — instead of processing spatial and temporal features in separate passes, Kling’s attention module integrates both into a single operation. This allows the model to capture local spatial features within frames and temporal dynamics across frames simultaneously, producing more natural motion
Model version guide
Section titled “Model version guide”Kling offers five model versions, each building on the previous:
| Version | Key improvement | Best for |
|---|---|---|
kling-v1 | Original release | Baseline quality, fastest |
kling-v1-6 | Improved motion coherence | General-purpose video |
kling-v2-master | Major quality leap | High-quality commercial content |
kling-v2-1 | Refined temporal consistency | Smooth, natural movement |
kling-v2-1-master | Best overall quality | Maximum quality, commercial use |
For most use cases, start with kling-v2-1-master (the latest) and only switch to earlier versions if you need specific behavior.
Standout features
Section titled “Standout features”Camera choreography:
Kling provides direct control over camera movement through five preset types (simple, down_back, forward_up, right_turn_forward, left_turn_forward) and six independent axes (horizontal, vertical, pan, tilt, roll, zoom), each adjustable from -10 to 10. This gives you precise cinematic control that other video models handle only through prompt engineering.
Static and dynamic masks (img2vid):
- Static masks — define regions of the image that should remain still while the rest animates. Useful for keeping backgrounds stable while a subject moves.
- Dynamic masks with trajectories — define a mask for a subject and provide a sequence of (x, y) trajectory points. The model will move that subject along the specified path. This enables precise motion planning that prompt-based control can’t achieve.
Standard and Pro modes:
Each generation can use std (faster) or pro (higher quality) mode. Pro mode roughly doubles generation time but produces notably better temporal coherence and detail.
When to use Kling
Section titled “When to use Kling”- Cinematic content — camera choreography controls give you dolly shots, pans, and zooms that other models can only approximate via prompting
- Product animations — use static masks to keep the product crisp while animating the background
- Character animation — dynamic masks with trajectories give you frame-by-frame motion control
- Social media video — native 16:9, 9:16, and 1:1 aspect ratio support for all major platforms
For faster generation without camera/mask control, consider Wan 2.2 Lightning (~22s vs Kling’s ~300s). For audio-driven video or video continuation, Wan 2.7 supports those features.
Job types
Section titled “Job types”| Job type | Description | ETA |
|---|---|---|
inference.kling.txt2vid.v1 | Generate a video from text | ~300s |
inference.kling.img2vid.v1 | Generate a video from an image | ~300s |
Parameters
Section titled “Parameters”Text-to-video:
prompt(required) — text description, up to 2,500 charactersmodel— model version (default:kling-v1)mode— quality mode:std(default) orproaspect_ratio—16:9(default),9:16, or1:1duration—5(default) or10secondsnegative_prompt— content to excludecfg_scale— guidance scale, 0–1 (default: 0.5)camera_control— object withtypeand optionalconfig:type:simple,down_back,forward_up,right_turn_forward, orleft_turn_forwardconfig: object withhorizontal,vertical,pan,tilt,roll,zoom(each -10 to 10)
Image-to-video:
image— input image filenameimage_tail— optional last-frame imageprompt— text description to guide animationmodel_name— model version (default:kling-v1)mode—stdorproduration—5or10secondsstatic_mask— mask filename for areas that should remain stilldynamic_masks— array of{mask, trajectories}objects for guided motion, where each trajectory is a sequence of{x, y}points
Examples
Section titled “Examples”Text-to-video with camera control:
{ "type": "inference.kling.txt2vid.v1", "config": { "prompt": "A golden retriever running through a sunlit meadow, cinematic slow motion", "model": "kling-v2-1-master", "mode": "pro", "aspect_ratio": "16:9", "duration": "5", "camera_control": { "type": "simple", "config": { "horizontal": 3, "zoom": 2 } } }}Image-to-video with static mask:
{ "type": "inference.kling.img2vid.v1", "config": { "model_name": "kling-v2-1-master", "image": "product-on-table.jpg", "static_mask": "product-mask.png", "prompt": "The background transitions from day to night, city lights appear", "mode": "pro", "duration": "5" }}