Image Captioning

Isaac 0.3 Max:

Isaac 0.1 / 0.2:

The caption() helper produces text descriptions from images. Use captioning to create accessibility text, generate metadata, or build visual search features.

Basic usage

from perceptron import caption, image

result = caption(
    image(image_path),     # ImageNode wrapping a path/URL/bytes
    style="concise",       # str: "concise" | "detailed"
    expects="text",        # str: "text" | "box" | "point"
    reasoning=True,        # bool: enable reasoning and include chain-of-thought
)

print(result.reasoning)    # Chain-of-thought (None when reasoning=False)
print(result.text)         # The caption

# When expects="box", access grounded snippets via result.boxes
for box in result.boxes or []:
    print(box.mention, box)

Parameters:

Parameter	Type	Default	Description
`media_obj`	`MediaNode`	-	Wrap your image (path, URL, or bytes) with `image()`.
`style`	`str`	`"concise"`	`"concise"` for short summaries, `"detailed"` for rich narratives
`expects`	`str`	`"text"`	`"text"` for caption only, `"box"` for caption + boxes, `"point"` for caption + points
`reasoning`	`bool`	`False`	Set `True` to enable reasoning and include the model’s chain-of-thought

Returns: PerceiveResult object:

text (str): The generated caption.
reasoning (str | None): Chain-of-thought when reasoning=True.
boxes, points (list | None): Populated based on the expects you requested. boxes_to_pixels / points_to_pixels convert normalized → pixel coordinates.

Example: grounded captions

In this example, we download a suburban street image and generate grounded captions with interleaved text and bounding boxes. The model returns a detailed description along with bounding boxes that correspond to specific regions mentioned in the caption. Each box includes a mention field containing the text snippet that describes that region, creating an interleaved representation of text and spatial annotations.

from pathlib import Path
from urllib.request import urlretrieve

from perceptron import caption, configure, image
from PIL import Image as PILImage, ImageDraw

configure(
    provider="perceptron",
    model="isaac-0.3-max",
    api_key="YOUR_API_KEY",
)

# Download image if it doesn't exist
IMAGE_URL = "https://raw.githubusercontent.com/perceptron-ai-inc/perceptron/main/cookbook/_shared/assets/capabilities/caption/suburban_street.webp"
IMAGE_PATH = Path("suburban_street.webp")
ANNOTATED_PATH = Path("suburban_street_annotated.png")

if not IMAGE_PATH.exists():
    urlretrieve(IMAGE_URL, IMAGE_PATH)

# Generate detailed caption with bounding boxes
result = caption(
    image(str(IMAGE_PATH)),
    style="detailed",
    expects="box",
)

print(result.text)

# Draw bounding boxes on the image
img = PILImage.open(IMAGE_PATH).convert("RGB")
draw = ImageDraw.Draw(img)

pixel_boxes = result.boxes_to_pixels(width=img.width, height=img.height) or []
print(f"Found {len(pixel_boxes)} grounded regions")

for box in pixel_boxes:
    draw.rectangle(
        [int(box.top_left.x), int(box.top_left.y), int(box.bottom_right.x), int(box.bottom_right.y)],
        outline="orange",
        width=3,
    )
    label = box.mention or "region"
    draw.text((int(box.top_left.x), max(int(box.top_left.y) - 18, 0)), label, fill="orange")

img.save(ANNOTATED_PATH)
print(f"Saved annotated image to {ANNOTATED_PATH}")

All spatial outputs use a 0-1000 normalized coordinate system. Convert via result.points_to_pixels(width, height) before rendering overlays — see the coordinate system guide for more patterns.

CLI usage

perceptron caption <image_path_or_url> [--style concise|detailed] [--expects text|box|point]

Examples:

# Generate a concise caption
perceptron caption image.jpg --style concise

# Generate a detailed caption with bounding boxes
perceptron caption image.jpg --style detailed --expects box

The CLI auto-detects video paths (.mp4) and routes them to a video() node.

Best practices

Structured outputs: Perceptron can return formatted data when you specify it up front — for example, “Describe the people in the image as JSON with keys hair_color, shirt_color, person_type.”

Run through the full Jupyter notebook here. Reach out to Perceptron support if you have questions.

Get Started

Capabilities

Developer Guides

Scaling & deployment

Best practices

Basic usage

Example: grounded captions

CLI usage

Best practices

Get Started

Capabilities

Developer Guides

Scaling & deployment

Best practices

Documentation Index

​Basic usage

​Example: grounded captions

​CLI usage

​Best practices

Basic usage

Example: grounded captions

CLI usage

Best practices