Segmenting Videos

The video segmentation endpoint uses Meta’s SAM 3 Video Predictor to detect and track objects across an mp4. You provide a single text prompt describing what to segment — SAM 3 finds matching objects in the first frame and tracks them forwards and backwards through the video. The output is a single mp4 with the same resolution and fps as the input.

The job type is inference.sam3.segment.video.v1.

This is useful for:

Creating per-frame object masks for VFX or rotoscoping
Tracking subjects across a clip without manual keyframing
Generating overlays that visualize what a model is “seeing” in a video

The examples below use a sample reef clip hosted at docs.prodia.com/fish.mp4. Swap in your own mp4 by changing the input path.

Mask mode (default)

The default mask mode returns a black-and-white mp4 — white pixels mark any frame region covered by a detected object, everything else is black.

import fs from "node:fs/promises";
import { createProdia } from "prodia/v2";

const prodia = createProdia({
  token: process.env.PRODIA_TOKEN,
});

const inputBuffer = await (await fetch("https://docs.prodia.com/fish.mp4")).arrayBuffer();

console.log("Segmenting video...");
const job = await prodia.job(
  {
    type: "inference.sam3.segment.video.v1",
    config: {
      prompt: "fish",
    },
  },
  { accept: "video/mp4", inputs: [new Uint8Array(inputBuffer)] }
);

const video = await job.arrayBuffer();
await fs.writeFile("mask.mp4", new Uint8Array(video));
console.log("Saved mask.mp4");

node main.js

Overlay mode

overlay mode composites the colored masks, bounding boxes, and id=<N>, p=<score> labels from the SAM 3 visualization over the original video. This matches the SAM 3 README example output and is useful for previewing what the model detected.

import fs from "node:fs/promises";
import { createProdia } from "prodia/v2";

const prodia = createProdia({
  token: process.env.PRODIA_TOKEN,
});

const inputBuffer = await (await fetch("https://docs.prodia.com/fish.mp4")).arrayBuffer();

const job = await prodia.job(
  {
    type: "inference.sam3.segment.video.v1",
    config: {
      prompt: "fish",
      mode: "overlay",
      alpha: 0.5,
    },
  },
  { accept: "video/mp4", inputs: [new Uint8Array(inputBuffer)] }
);

const video = await job.arrayBuffer();
await fs.writeFile("overlay.mp4", new Uint8Array(video));

node main.js

Filtering low-confidence detections

Raise confidence_threshold to suppress weakly-matched objects. SAM 3 attaches a per-object score to every detection; objects below the threshold are dropped before the mask is merged or rendered.

const job = await prodia.job(
  {
    type: "inference.sam3.segment.video.v1",
    config: {
      prompt: "fish",
      confidence_threshold: 0.9,
    },
  },
  { accept: "video/mp4", inputs: [new Uint8Array(inputBuffer)] }
);

Parameters

Parameter	Type	Default	Description
`prompt`	string	(required)	Text describing what to segment and track across the video (1–500 characters).
`confidence_threshold`	number	`0.5`	Minimum SAM 3 score an object must reach to be kept (0.0–1.0). Lower values keep more detections.
`mode`	enum	`mask`	`mask` returns a merged black-and-white mp4. `overlay` returns the colored SAM 3 visualization composited over the input.
`alpha`	number	`0.5`	Mask alpha used when `mode` is `overlay`. Ignored otherwise (0.0–1.0).

Input requirements

Constraint	Value
Accepted formats	MP4 (`video/mp4`)
Maximum file size	100 MB

Input resolution and fps are preserved on the output. Common sizes (832×480, 1280×720) are warmed into the torch.compile cache on bootstrap, so they encode the fastest.