Skip to content

Kling

Kling is a video generation model family from Kuaishou Technology, the company behind the Kwai short-video platform. Among the video models available on Prodia, Kling stands out for its precise camera control system and motion masking capabilities — features that give you fine-grained control over how subjects and camera move within generated videos.

Kling uses a Diffusion Transformer (DiT) architecture with two proprietary innovations:

  • Custom 3D VAE — Kuaishou’s self-developed spatiotemporal variational autoencoder compresses video data simultaneously across space and time. This approach achieves high reconstruction quality while keeping training efficient — a better balance than encoding spatial and temporal dimensions separately
  • Full-attention spatiotemporal mechanism — instead of processing spatial and temporal features in separate passes, Kling’s attention module integrates both into a single operation. This allows the model to capture local spatial features within frames and temporal dynamics across frames simultaneously, producing more natural motion

Kling offers five model versions, each building on the previous:

VersionKey improvementBest for
kling-v1Original releaseBaseline quality, fastest
kling-v1-6Improved motion coherenceGeneral-purpose video
kling-v2-masterMajor quality leapHigh-quality commercial content
kling-v2-1Refined temporal consistencySmooth, natural movement
kling-v2-1-masterBest overall qualityMaximum quality, commercial use

For most use cases, start with kling-v2-1-master (the latest) and only switch to earlier versions if you need specific behavior.

Camera choreography: Kling provides direct control over camera movement through five preset types (simple, down_back, forward_up, right_turn_forward, left_turn_forward) and six independent axes (horizontal, vertical, pan, tilt, roll, zoom), each adjustable from -10 to 10. This gives you precise cinematic control that other video models handle only through prompt engineering.

Static and dynamic masks (img2vid):

  • Static masks — define regions of the image that should remain still while the rest animates. Useful for keeping backgrounds stable while a subject moves.
  • Dynamic masks with trajectories — define a mask for a subject and provide a sequence of (x, y) trajectory points. The model will move that subject along the specified path. This enables precise motion planning that prompt-based control can’t achieve.

Standard and Pro modes: Each generation can use std (faster) or pro (higher quality) mode. Pro mode roughly doubles generation time but produces notably better temporal coherence and detail.

  • Cinematic content — camera choreography controls give you dolly shots, pans, and zooms that other models can only approximate via prompting
  • Product animations — use static masks to keep the product crisp while animating the background
  • Character animation — dynamic masks with trajectories give you frame-by-frame motion control
  • Social media video — native 16:9, 9:16, and 1:1 aspect ratio support for all major platforms

For faster generation without camera/mask control, consider Wan 2.2 Lightning (~22s vs Kling’s ~300s). For audio-driven video or video continuation, Wan 2.7 supports those features.

Job typeDescriptionETA
inference.kling.txt2vid.v1Generate a video from text~300s
inference.kling.img2vid.v1Generate a video from an image~300s

Text-to-video:

  • prompt (required) — text description, up to 2,500 characters
  • model — model version (default: kling-v1)
  • mode — quality mode: std (default) or pro
  • aspect_ratio16:9 (default), 9:16, or 1:1
  • duration5 (default) or 10 seconds
  • negative_prompt — content to exclude
  • cfg_scale — guidance scale, 0–1 (default: 0.5)
  • camera_control — object with type and optional config:
    • type: simple, down_back, forward_up, right_turn_forward, or left_turn_forward
    • config: object with horizontal, vertical, pan, tilt, roll, zoom (each -10 to 10)

Image-to-video:

  • image — input image filename
  • image_tail — optional last-frame image
  • prompt — text description to guide animation
  • model_name — model version (default: kling-v1)
  • modestd or pro
  • duration5 or 10 seconds
  • static_mask — mask filename for areas that should remain still
  • dynamic_masks — array of {mask, trajectories} objects for guided motion, where each trajectory is a sequence of {x, y} points

Text-to-video with camera control:

{
"type": "inference.kling.txt2vid.v1",
"config": {
"prompt": "A golden retriever running through a sunlit meadow, cinematic slow motion",
"model": "kling-v2-1-master",
"mode": "pro",
"aspect_ratio": "16:9",
"duration": "5",
"camera_control": {
"type": "simple",
"config": {
"horizontal": 3,
"zoom": 2
}
}
}
}

Image-to-video with static mask:

{
"type": "inference.kling.img2vid.v1",
"config": {
"model_name": "kling-v2-1-master",
"image": "product-on-table.jpg",
"static_mask": "product-mask.png",
"prompt": "The background transitions from day to night, city lights appear",
"mode": "pro",
"duration": "5"
}
}