Skip to content

SDXL

SDXL (Stable Diffusion XL) is the model to reach for when you need fast, cheap image generation with fine-grained control. It produces 1024x1024 images in under two seconds and costs a fraction of a cent per image, making it well-suited to high-volume workflows, experimentation, and applications where iteration speed matters more than absolute fidelity. It also exposes more knobs than most newer models — explicit steps, guidance, seed, and negative_prompt parameters give you direct control over the diffusion process.

SDXL is a latent-diffusion model built around a U-Net (~2.6B parameters) operating in the latent space of a VAE. It uses two text encoders in parallel — OpenAI’s CLIP ViT-L and the larger OpenCLIP ViT-bigG — and concatenates their outputs to form a richer text embedding (~3.5B total parameters across the pipeline). This dual-encoder setup is the main reason SDXL produces noticeably better composition and prompt adherence than its 1.5 predecessor while remaining small enough to run on a single consumer GPU.

The model includes an optional refiner stage — a second U-Net specialised for the final denoising steps. Enabling refiner: true improves fine detail (skin texture, foliage, fabric) at the cost of extra latency.

  • High-volume generation — sub-2s ETAs and ~$0.002 per image make SDXL viable for batch jobs, A/B tests, and user-facing prototypes
  • Style-controlled output — the style_preset parameter applies one of 17 baked-in styles (anime, photographic, neon-punk, line-art, etc.) without hand-crafting prompts
  • Negative prompting — explicit negative_prompt support is rare among newer models; useful when you need to exclude specific subjects, artifacts, or qualities
  • Image editing without retrainingimg2img and inpainting variants share the same backbone, so style and quality stay consistent across the workflow
  • Reproducibility-critical work — the seed parameter combined with deterministic steps makes SDXL easy to use for caching, regression tests, and side-by-side comparisons

For higher-fidelity output and stronger prompt following, consider FLUX.2. For text-in-image workloads, use Recraft V4.

Job typeDescriptionETA
inference.sdxl.txt2img.v1Generate an image from a text prompt~1.8s
inference.sdxl.img2img.v1Transform an existing image guided by a prompt~1.5s
inference.sdxl.inpainting.v1Replace a masked region of an image~1.7s

Common to all job types:

  • prompt (required) — text description, 3–1024 characters
  • negative_prompt — qualities or subjects to avoid, 3–1024 characters
  • style_preset — one of: 3d-model, analog-film, anime, cinematic, comic-book, craft-clay, digital-art, enhance, fantasy-art, isometric, line-art, low-poly, neon-punk, origami, photographic, pixel-art, texture
  • guidance — classifier-free guidance scale, default 8. Lower values (5–7) give more creative results, higher values (9–12) follow the prompt more strictly
  • seed — integer for reproducible output; omit for random
  • steps — diffusion steps, 1–1000, default 25. 20–30 is the sweet spot
  • refiner — boolean, default false. Adds the SDXL refiner stage for sharper detail
  • conditioner — boolean, default false. Enables additional conditioning signals
  • noise — noise scale, default 1

txt2img only:

  • width, height — output dimensions, 512–1536, default 1024. SDXL was trained at 1024x1024 and works best at that resolution

img2img and inpainting:

  • strength — how much the input image is altered, 0–1, default 0.6. Lower values stay closer to the input

Inputs:

  • img2img and inpainting accept an input image (PNG, JPEG, or WebP, 256x256 to 1536x1536, max 10 MB)
  • inpainting additionally requires a mask image — white pixels are regenerated, black pixels are preserved
  • Lead with the subject: SDXL pays more attention to the start of the prompt. Put the main subject and key adjectives first, scene details after
  • Combine style preset with prompt: style_preset sets the overall aesthetic — your prompt can still describe subject and composition. Prefer the preset over hand-written style words like “in the style of anime”
  • Use negative prompts surgically: terms like blurry, low quality, distorted, watermark, text are reliable defaults. Avoid stuffing too many terms or you’ll wash out the signal
  • Tune guidance for tone: drop to 6 for more creative variation, raise to 10 for tighter prompt adherence. Above 12 images often look over-saturated
  • Stick to 1024x1024: SDXL was trained at 1024 square. Other supported sizes work but may show composition drift, especially at the extremes (512 or 1536)
  • Enable the refiner for portraits and close-ups: refiner: true notably improves skin, eyes, and fine textures. It adds latency, so leave it off for thumbnails and previews

Text to image with a style preset and seed:

{
"type": "inference.sdxl.txt2img.v1",
"config": {
"prompt": "a majestic snow leopard perched on a frozen mountain ledge, sweeping vista of snowy peaks behind, golden hour, sharp focus",
"negative_prompt": "blurry, low quality, distorted, painting",
"style_preset": "photographic",
"width": 1024,
"height": 1024,
"seed": 42,
"refiner": true
}
}

Image-to-image transformation with low strength to preserve composition:

{
"type": "inference.sdxl.img2img.v1",
"config": {
"prompt": "a watercolour painting of the same scene, soft pastel tones, visible brush strokes",
"style_preset": "analog-film",
"strength": 0.45,
"guidance": 7
}
}

Inpainting a masked region:

{
"type": "inference.sdxl.inpainting.v1",
"config": {
"prompt": "a vase of fresh sunflowers on the table",
"negative_prompt": "blurry, distorted",
"strength": 0.85,
"steps": 30
}
}