Image Captioning - Perceptron Docs

Run in Colab

Step through this example interactively

The caption() helper produces text descriptions from images. Use captioning to create accessibility text, generate metadata, or build visual search features.

Basic usage

from perceptron import caption, image

result = caption(
    image(image_path),     # ImageNode wrapping a path/URL/bytes
    style="concise",       # str: "concise" | "detailed"
    expects="text",        # str: "text" | "box" | "point"
    reasoning=True,        # bool: enable reasoning and include chain-of-thought
)

print(result.reasoning)    # Chain-of-thought (None when reasoning=False)
print(result.text)         # The caption

# When expects="box", access grounded snippets via result.boxes
for box in result.boxes or []:
    print(box.mention, box)

Parameters:

Parameter	Type	Default	Description
`media_obj`	`MediaNode`	-	Wrap your image (path, URL, or bytes) with `image()`.
`style`	`str`	`"concise"`	`"concise"` for short summaries, `"detailed"` for rich narratives
`expects`	`str`	`"text"`	`"text"` for caption only, `"box"` for caption + boxes, `"point"` for caption + points
`reasoning`	`bool`	`False`	Set `True` to enable reasoning and include the model’s chain-of-thought

Returns: PerceiveResult object:

text (str): The generated caption.
reasoning (str | None): Chain-of-thought when reasoning=True.
boxes, points (list | None): Populated based on the expects you requested. boxes_to_pixels / points_to_pixels convert normalized → pixel coordinates.

Example: grounded captions

In this example, we download a suburban street image and generate a concise caption with grounded bounding boxes. The model returns short prose along with boxes that correspond to specific regions mentioned in the caption — each box includes a mention field containing the text snippet that describes that region.

from pathlib import Path
from urllib.request import urlretrieve

from perceptron import caption, configure, image
from PIL import Image as PILImage, ImageDraw

configure(
    provider="perceptron",
    model="isaac-0.2-2b-preview",
    api_key="YOUR_API_KEY",
)

# Download image if it doesn't exist
IMAGE_URL = "https://raw.githubusercontent.com/perceptron-ai-inc/perceptron/main/cookbook/_shared/assets/capabilities/caption/suburban_street.webp"
IMAGE_PATH = Path("suburban_street.webp")
ANNOTATED_PATH = Path("suburban_street_annotated.png")

if not IMAGE_PATH.exists():
    urlretrieve(IMAGE_URL, IMAGE_PATH)

# Generate concise caption with bounding boxes
result = caption(
    image(str(IMAGE_PATH)),
    style="concise",
    expects="box",
)

print(result.text)

# Draw bounding boxes on the image
img = PILImage.open(IMAGE_PATH).convert("RGB")
draw = ImageDraw.Draw(img)

pixel_boxes = result.boxes_to_pixels(width=img.width, height=img.height) or []
print(f"Found {len(pixel_boxes)} grounded regions")

for box in pixel_boxes:
    draw.rectangle(
        [int(box.top_left.x), int(box.top_left.y), int(box.bottom_right.x), int(box.bottom_right.y)],
        outline="orange",
        width=3,
    )
    label = box.mention or "region"
    draw.text((int(box.top_left.x), max(int(box.top_left.y) - 18, 0)), label, fill="orange")

img.save(ANNOTATED_PATH)
print(f"Saved annotated image to {ANNOTATED_PATH}")

All spatial outputs use a 0-1000 normalized coordinate system. Convert via result.points_to_pixels(width, height) before rendering overlays — see the coordinate system guide for more patterns.

CLI usage

perceptron caption <image_path_or_url> [--style concise|detailed] [--expects text|box|point]

Examples:

# Generate a concise caption
perceptron caption image.jpg --style concise

# Generate a detailed caption with bounding boxes
perceptron caption image.jpg --style detailed --expects box

The CLI auto-detects video paths (.mp4) and routes them to a video() node.

Best practices

Structured outputs: Perceptron can return formatted data when you specify it up front — for example, “Describe the people in the image as JSON with keys hair_color, shirt_color, person_type.”

Run through the full Jupyter notebook here. Reach out to Perceptron support if you have questions.

Run in Colab

​Basic usage

​Example: grounded captions

​CLI usage

​Best practices

Basic usage

Example: grounded captions

CLI usage

Best practices