Kling

Kling is a video generation model family from Kuaishou Technology, the company behind the Kwai short-video platform. Among the video models available on Prodia, Kling stands out for its precise camera control system and motion masking capabilities — features that give you fine-grained control over how subjects and camera move within generated videos.

Architecture

Kling uses a Diffusion Transformer (DiT) architecture with two proprietary innovations:

Custom 3D VAE — Kuaishou’s self-developed spatiotemporal variational autoencoder compresses video data simultaneously across space and time. This approach achieves high reconstruction quality while keeping training efficient — a better balance than encoding spatial and temporal dimensions separately
Full-attention spatiotemporal mechanism — instead of processing spatial and temporal features in separate passes, Kling’s attention module integrates both into a single operation. This allows the model to capture local spatial features within frames and temporal dynamics across frames simultaneously, producing more natural motion

Model version guide

Kling offers five model versions, each building on the previous:

Version	Key improvement	Best for
`kling-v1`	Original release	Baseline quality, fastest
`kling-v1-6`	Improved motion coherence	General-purpose video
`kling-v2-master`	Major quality leap	High-quality commercial content
`kling-v2-1`	Refined temporal consistency	Smooth, natural movement
`kling-v2-1-master`	Best overall quality	Maximum quality, commercial use

For most use cases, start with kling-v2-1-master (the latest) and only switch to earlier versions if you need specific behavior.

Standout features

Camera choreography: Kling provides direct control over camera movement through five preset types (simple, down_back, forward_up, right_turn_forward, left_turn_forward) and six independent axes (horizontal, vertical, pan, tilt, roll, zoom), each adjustable from -10 to 10. This gives you precise cinematic control that other video models handle only through prompt engineering.

Static and dynamic masks (img2vid):

Static masks — define regions of the image that should remain still while the rest animates. Useful for keeping backgrounds stable while a subject moves.
Dynamic masks with trajectories — define a mask for a subject and provide a sequence of (x, y) trajectory points. The model will move that subject along the specified path. This enables precise motion planning that prompt-based control can’t achieve.

Standard and Pro modes: Each generation can use std (faster) or pro (higher quality) mode. Pro mode roughly doubles generation time but produces notably better temporal coherence and detail.

When to use Kling

Cinematic content — camera choreography controls give you dolly shots, pans, and zooms that other models can only approximate via prompting
Product animations — use static masks to keep the product crisp while animating the background
Character animation — dynamic masks with trajectories give you frame-by-frame motion control
Social media video — native 16:9, 9:16, and 1:1 aspect ratio support for all major platforms

For faster generation without camera/mask control, consider Wan 2.2 Lightning (~22s vs Kling’s ~300s). For audio-driven video or video continuation, Wan 2.7 supports those features.

Job types

Job type	Description	ETA
`inference.kling.txt2vid.v1`	Generate a video from text	~300s
`inference.kling.img2vid.v1`	Generate a video from an image	~300s

Parameters

Text-to-video:

prompt (required) — text description, up to 2,500 characters
model — model version (default: kling-v1)
mode — quality mode: std (default) or pro
aspect_ratio — 16:9 (default), 9:16, or 1:1
duration — 5 (default) or 10 seconds
negative_prompt — content to exclude
cfg_scale — guidance scale, 0–1 (default: 0.5)
camera_control — object with type and optional config:
- type: simple, down_back, forward_up, right_turn_forward, or left_turn_forward
- config: object with horizontal, vertical, pan, tilt, roll, zoom (each -10 to 10)

Image-to-video:

image — input image filename
image_tail — optional last-frame image
prompt — text description to guide animation
model_name — model version (default: kling-v1)
mode — std or pro
duration — 5 or 10 seconds
static_mask — mask filename for areas that should remain still
dynamic_masks — array of {mask, trajectories} objects for guided motion, where each trajectory is a sequence of {x, y} points

Examples

Text-to-video with camera control:

{
  "type": "inference.kling.txt2vid.v1",
  "config": {
    "prompt": "A golden retriever running through a sunlit meadow, cinematic slow motion",
    "model": "kling-v2-1-master",
    "mode": "pro",
    "aspect_ratio": "16:9",
    "duration": "5",
    "camera_control": {
      "type": "simple",
      "config": {
        "horizontal": 3,
        "zoom": 2
      }
    }
  }
}

Image-to-video with static mask:

{
  "type": "inference.kling.img2vid.v1",
  "config": {
    "model_name": "kling-v2-1-master",
    "image": "product-on-table.jpg",
    "static_mask": "product-mask.png",
    "prompt": "The background transitions from day to night, city lights appear",
    "mode": "pro",
    "duration": "5"
  }
}

Guides

Generating Videos Step-by-step guide for generating videos with Prodia, including text-to-video and image-to-video examples.