SDXL

SDXL (Stable Diffusion XL) is the model to reach for when you need fast, cheap image generation with fine-grained control. It produces 1024x1024 images in under two seconds and costs a fraction of a cent per image, making it well-suited to high-volume workflows, experimentation, and applications where iteration speed matters more than absolute fidelity. It also exposes more knobs than most newer models — explicit steps, guidance, seed, and negative_prompt parameters give you direct control over the diffusion process.

Architecture

SDXL is a latent-diffusion model built around a U-Net (~2.6B parameters) operating in the latent space of a VAE. It uses two text encoders in parallel — OpenAI’s CLIP ViT-L and the larger OpenCLIP ViT-bigG — and concatenates their outputs to form a richer text embedding (~3.5B total parameters across the pipeline). This dual-encoder setup is the main reason SDXL produces noticeably better composition and prompt adherence than its 1.5 predecessor while remaining small enough to run on a single consumer GPU.

The model includes an optional refiner stage — a second U-Net specialised for the final denoising steps. Enabling refiner: true improves fine detail (skin texture, foliage, fabric) at the cost of extra latency.

When to use SDXL

High-volume generation — sub-2s ETAs and ~$0.002 per image make SDXL viable for batch jobs, A/B tests, and user-facing prototypes
Style-controlled output — the style_preset parameter applies one of 17 baked-in styles (anime, photographic, neon-punk, line-art, etc.) without hand-crafting prompts
Negative prompting — explicit negative_prompt support is rare among newer models; useful when you need to exclude specific subjects, artifacts, or qualities
Image editing without retraining — img2img and inpainting variants share the same backbone, so style and quality stay consistent across the workflow
Reproducibility-critical work — the seed parameter combined with deterministic steps makes SDXL easy to use for caching, regression tests, and side-by-side comparisons

For higher-fidelity output and stronger prompt following, consider FLUX.2. For text-in-image workloads, use Recraft V4.

Job types

Job type	Description	ETA
`inference.sdxl.txt2img.v1`	Generate an image from a text prompt	~1.8s
`inference.sdxl.img2img.v1`	Transform an existing image guided by a prompt	~1.5s
`inference.sdxl.inpainting.v1`	Replace a masked region of an image	~1.7s

Parameters

Common to all job types:

prompt (required) — text description, 3–1024 characters
negative_prompt — qualities or subjects to avoid, 3–1024 characters
style_preset — one of: 3d-model, analog-film, anime, cinematic, comic-book, craft-clay, digital-art, enhance, fantasy-art, isometric, line-art, low-poly, neon-punk, origami, photographic, pixel-art, texture
guidance — classifier-free guidance scale, default 8. Lower values (5–7) give more creative results, higher values (9–12) follow the prompt more strictly
seed — integer for reproducible output; omit for random
steps — diffusion steps, 1–1000, default 25. 20–30 is the sweet spot
refiner — boolean, default false. Adds the SDXL refiner stage for sharper detail
conditioner — boolean, default false. Enables additional conditioning signals
noise — noise scale, default 1

txt2img only:

width, height — output dimensions, 512–1536, default 1024. SDXL was trained at 1024x1024 and works best at that resolution

img2img and inpainting:

strength — how much the input image is altered, 0–1, default 0.6. Lower values stay closer to the input

Inputs:

img2img and inpainting accept an input image (PNG, JPEG, or WebP, 256x256 to 1536x1536, max 10 MB)
inpainting additionally requires a mask image — white pixels are regenerated, black pixels are preserved

Prompting tips

Lead with the subject: SDXL pays more attention to the start of the prompt. Put the main subject and key adjectives first, scene details after
Combine style preset with prompt: style_preset sets the overall aesthetic — your prompt can still describe subject and composition. Prefer the preset over hand-written style words like “in the style of anime”
Use negative prompts surgically: terms like blurry, low quality, distorted, watermark, text are reliable defaults. Avoid stuffing too many terms or you’ll wash out the signal
Tune guidance for tone: drop to 6 for more creative variation, raise to 10 for tighter prompt adherence. Above 12 images often look over-saturated
Stick to 1024x1024: SDXL was trained at 1024 square. Other supported sizes work but may show composition drift, especially at the extremes (512 or 1536)
Enable the refiner for portraits and close-ups: refiner: true notably improves skin, eyes, and fine textures. It adds latency, so leave it off for thumbnails and previews

Examples

Text to image with a style preset and seed:

{
  "type": "inference.sdxl.txt2img.v1",
  "config": {
    "prompt": "a majestic snow leopard perched on a frozen mountain ledge, sweeping vista of snowy peaks behind, golden hour, sharp focus",
    "negative_prompt": "blurry, low quality, distorted, painting",
    "style_preset": "photographic",
    "width": 1024,
    "height": 1024,
    "seed": 42,
    "refiner": true
  }
}

Image-to-image transformation with low strength to preserve composition:

{
  "type": "inference.sdxl.img2img.v1",
  "config": {
    "prompt": "a watercolour painting of the same scene, soft pastel tones, visible brush strokes",
    "style_preset": "analog-film",
    "strength": 0.45,
    "guidance": 7
  }
}

Inpainting a masked region:

{
  "type": "inference.sdxl.inpainting.v1",
  "config": {
    "prompt": "a vase of fresh sunflowers on the table",
    "negative_prompt": "blurry, distorted",
    "strength": 0.85,
    "steps": 30
  }
}

Guides

Generating Images Step-by-step guide for generating images with Prodia.

Transforming Images Use img2img to transform an existing image.