SDXL
SDXL (Stable Diffusion XL) is the model to reach for when you need fast, cheap image generation with fine-grained control. It produces 1024x1024 images in under two seconds and costs a fraction of a cent per image, making it well-suited to high-volume workflows, experimentation, and applications where iteration speed matters more than absolute fidelity. It also exposes more knobs than most newer models — explicit steps, guidance, seed, and negative_prompt parameters give you direct control over the diffusion process.
Architecture
Section titled “Architecture”SDXL is a latent-diffusion model built around a U-Net (~2.6B parameters) operating in the latent space of a VAE. It uses two text encoders in parallel — OpenAI’s CLIP ViT-L and the larger OpenCLIP ViT-bigG — and concatenates their outputs to form a richer text embedding (~3.5B total parameters across the pipeline). This dual-encoder setup is the main reason SDXL produces noticeably better composition and prompt adherence than its 1.5 predecessor while remaining small enough to run on a single consumer GPU.
The model includes an optional refiner stage — a second U-Net specialised for the final denoising steps. Enabling refiner: true improves fine detail (skin texture, foliage, fabric) at the cost of extra latency.
When to use SDXL
Section titled “When to use SDXL”- High-volume generation — sub-2s ETAs and ~$0.002 per image make SDXL viable for batch jobs, A/B tests, and user-facing prototypes
- Style-controlled output — the
style_presetparameter applies one of 17 baked-in styles (anime, photographic, neon-punk, line-art, etc.) without hand-crafting prompts - Negative prompting — explicit
negative_promptsupport is rare among newer models; useful when you need to exclude specific subjects, artifacts, or qualities - Image editing without retraining —
img2imgandinpaintingvariants share the same backbone, so style and quality stay consistent across the workflow - Reproducibility-critical work — the
seedparameter combined with deterministicstepsmakes SDXL easy to use for caching, regression tests, and side-by-side comparisons
For higher-fidelity output and stronger prompt following, consider FLUX.2. For text-in-image workloads, use Recraft V4.
Job types
Section titled “Job types”| Job type | Description | ETA |
|---|---|---|
inference.sdxl.txt2img.v1 | Generate an image from a text prompt | ~1.8s |
inference.sdxl.img2img.v1 | Transform an existing image guided by a prompt | ~1.5s |
inference.sdxl.inpainting.v1 | Replace a masked region of an image | ~1.7s |
Parameters
Section titled “Parameters”Common to all job types:
prompt(required) — text description, 3–1024 charactersnegative_prompt— qualities or subjects to avoid, 3–1024 charactersstyle_preset— one of:3d-model,analog-film,anime,cinematic,comic-book,craft-clay,digital-art,enhance,fantasy-art,isometric,line-art,low-poly,neon-punk,origami,photographic,pixel-art,textureguidance— classifier-free guidance scale, default8. Lower values (5–7) give more creative results, higher values (9–12) follow the prompt more strictlyseed— integer for reproducible output; omit for randomsteps— diffusion steps, 1–1000, default25. 20–30 is the sweet spotrefiner— boolean, defaultfalse. Adds the SDXL refiner stage for sharper detailconditioner— boolean, defaultfalse. Enables additional conditioning signalsnoise— noise scale, default1
txt2img only:
width,height— output dimensions, 512–1536, default1024. SDXL was trained at 1024x1024 and works best at that resolution
img2img and inpainting:
strength— how much the input image is altered, 0–1, default0.6. Lower values stay closer to the input
Inputs:
img2imgandinpaintingaccept an input image (PNG, JPEG, or WebP, 256x256 to 1536x1536, max 10 MB)inpaintingadditionally requires a mask image — white pixels are regenerated, black pixels are preserved
Prompting tips
Section titled “Prompting tips”- Lead with the subject: SDXL pays more attention to the start of the prompt. Put the main subject and key adjectives first, scene details after
- Combine style preset with prompt:
style_presetsets the overall aesthetic — your prompt can still describe subject and composition. Prefer the preset over hand-written style words like “in the style of anime” - Use negative prompts surgically: terms like
blurry, low quality, distorted, watermark, textare reliable defaults. Avoid stuffing too many terms or you’ll wash out the signal - Tune
guidancefor tone: drop to6for more creative variation, raise to10for tighter prompt adherence. Above12images often look over-saturated - Stick to 1024x1024: SDXL was trained at 1024 square. Other supported sizes work but may show composition drift, especially at the extremes (512 or 1536)
- Enable the refiner for portraits and close-ups:
refiner: truenotably improves skin, eyes, and fine textures. It adds latency, so leave it off for thumbnails and previews
Examples
Section titled “Examples”Text to image with a style preset and seed:
{ "type": "inference.sdxl.txt2img.v1", "config": { "prompt": "a majestic snow leopard perched on a frozen mountain ledge, sweeping vista of snowy peaks behind, golden hour, sharp focus", "negative_prompt": "blurry, low quality, distorted, painting", "style_preset": "photographic", "width": 1024, "height": 1024, "seed": 42, "refiner": true }}Image-to-image transformation with low strength to preserve composition:
{ "type": "inference.sdxl.img2img.v1", "config": { "prompt": "a watercolour painting of the same scene, soft pastel tones, visible brush strokes", "style_preset": "analog-film", "strength": 0.45, "guidance": 7 }}Inpainting a masked region:
{ "type": "inference.sdxl.inpainting.v1", "config": { "prompt": "a vase of fresh sunflowers on the table", "negative_prompt": "blurry, distorted", "strength": 0.85, "steps": 30 }}