Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.perceptron.inc/llms.txt

Use this file to discover all available pages before exploring further.

Isaac 0.3 Max:  Open In Colab   Isaac 0.1 / 0.2:  Open In Colab The caption() helper produces text descriptions from images. Use captioning to create accessibility text, generate metadata, or build visual search features.

Basic usage

from perceptron import caption, image

result = caption(
    image(image_path),     # ImageNode wrapping a path/URL/bytes
    style="concise",       # str: "concise" | "detailed"
    expects="text",        # str: "text" | "box" | "point"
    reasoning=True,        # bool: enable reasoning and include chain-of-thought
)

print(result.reasoning)    # Chain-of-thought (None when reasoning=False)
print(result.text)         # The caption

# When expects="box", access grounded snippets via result.boxes
for box in result.boxes or []:
    print(box.mention, box)
Parameters:
ParameterTypeDefaultDescription
media_objMediaNode-Wrap your image (path, URL, or bytes) with image().
stylestr"concise""concise" for short summaries, "detailed" for rich narratives
expectsstr"text""text" for caption only, "box" for caption + boxes, "point" for caption + points
reasoningboolFalseSet True to enable reasoning and include the model’s chain-of-thought
Returns: PerceiveResult object:
  • text (str): The generated caption.
  • reasoning (str | None): Chain-of-thought when reasoning=True.
  • boxes, points (list | None): Populated based on the expects you requested. boxes_to_pixels / points_to_pixels convert normalized → pixel coordinates.

Example: grounded captions

In this example, we download a suburban street image and generate grounded captions with interleaved text and bounding boxes. The model returns a detailed description along with bounding boxes that correspond to specific regions mentioned in the caption. Each box includes a mention field containing the text snippet that describes that region, creating an interleaved representation of text and spatial annotations.
from pathlib import Path
from urllib.request import urlretrieve

from perceptron import caption, configure, image
from PIL import Image as PILImage, ImageDraw

configure(
    provider="perceptron",
    model="isaac-0.3-max",
    api_key="YOUR_API_KEY",
)

# Download image if it doesn't exist
IMAGE_URL = "https://raw.githubusercontent.com/perceptron-ai-inc/perceptron/main/cookbook/_shared/assets/capabilities/caption/suburban_street.webp"
IMAGE_PATH = Path("suburban_street.webp")
ANNOTATED_PATH = Path("suburban_street_annotated.png")

if not IMAGE_PATH.exists():
    urlretrieve(IMAGE_URL, IMAGE_PATH)

# Generate detailed caption with bounding boxes
result = caption(
    image(str(IMAGE_PATH)),
    style="detailed",
    expects="box",
)

print(result.text)

# Draw bounding boxes on the image
img = PILImage.open(IMAGE_PATH).convert("RGB")
draw = ImageDraw.Draw(img)

pixel_boxes = result.boxes_to_pixels(width=img.width, height=img.height) or []
print(f"Found {len(pixel_boxes)} grounded regions")

for box in pixel_boxes:
    draw.rectangle(
        [int(box.top_left.x), int(box.top_left.y), int(box.bottom_right.x), int(box.bottom_right.y)],
        outline="orange",
        width=3,
    )
    label = box.mention or "region"
    draw.text((int(box.top_left.x), max(int(box.top_left.y) - 18, 0)), label, fill="orange")

img.save(ANNOTATED_PATH)
print(f"Saved annotated image to {ANNOTATED_PATH}")
All spatial outputs use a 0-1000 normalized coordinate system. Convert via result.points_to_pixels(width, height) before rendering overlays — see the coordinate system guide for more patterns.

CLI usage

perceptron caption <image_path_or_url> [--style concise|detailed] [--expects text|box|point]
Examples:
# Generate a concise caption
perceptron caption image.jpg --style concise

# Generate a detailed caption with bounding boxes
perceptron caption image.jpg --style detailed --expects box
The CLI auto-detects video paths (.mp4) and routes them to a video() node.

Best practices

  • Structured outputs: Perceptron can return formatted data when you specify it up front — for example, “Describe the people in the image as JSON with keys hair_color, shirt_color, person_type.”
Run through the full Jupyter notebook here. Reach out to Perceptron support if you have questions.