Skip to main content
The caption() helper produces text descriptions from images. Use captioning to create accessibility text, generate metadata, or build visual search features.

Basic usage

from perceptron import caption

# Basic caption
result = caption(
    image_path,           # str: Path to image file
    style="concise",      # str: "concise" | "detailed"
    expects="text",       # str: "text" | "box" | "point"
    reasoning=True        # bool: enable reasoning and include chain-of-thought (when supported)
)

# Access results
result.text              # str: The caption text
result.points            # list: Bounding boxes (if expects="box")
result.points_to_pixels(width, height)  # Convert coordinates to pixels
Parameters:
ParameterTypeDefaultDescription
image_pathstr-Path to the image file (JPG, PNG, WEBP)
stylestr"concise""concise" for short summaries, "detailed" for rich narratives
expectsstr"text""text" for caption only, "box" for caption + boxes, "point" for caption + points
reasoningboolFalseSet True to enable reasoning and include the model’s chain-of-thought
Returns: PerceiveResult object:
  • text (str): The generated caption
  • points (list): Bounding boxes or points (when expects="box" or "point")

Example: grounded captions

In this example, we download a suburban street image and generate grounded captions with interleaved text and bounding boxes. The model returns a detailed description along with bounding boxes that correspond to specific regions mentioned in the caption. Each box includes a mention field containing the text snippet that describes that region, creating an interleaved representation of text and spatial annotations.
import os
from pathlib import Path
from urllib.request import urlretrieve

from perceptron import caption, configure
from perceptron.pointing.geometry import scale_box_to_pixels
from PIL import Image, ImageDraw

# Configure API key
configure(
    provider="perceptron",
    api_key=os.getenv("PERCEPTRON_API_KEY", "<your_api_key_here>"),
)

# Download image if it doesn't exist
IMAGE_URL = "https://raw.githubusercontent.com/perceptron-ai-inc/perceptron/main/cookbook/_shared/assets/capabilities/caption/suburban_street.webp"
IMAGE_PATH = Path("suburban_street.webp")
ANNOTATED_PATH = Path("suburban_street_annotated.png")

if not IMAGE_PATH.exists():
    urlretrieve(IMAGE_URL, IMAGE_PATH)

# Generate detailed caption with bounding boxes
result = caption(
    image_path=str(IMAGE_PATH),
    style="detailed",        # Rich, narrative description
    expects="box"            # Return caption + bounding boxes
)

print(result.text)
# Output: Detailed description of the image

# Access bounding boxes
boxes = result.points or []  # List of box objects
print(f"Found {len(boxes)} grounded regions")

# Draw bounding boxes on the image
img = Image.open(IMAGE_PATH).convert("RGB")
draw = ImageDraw.Draw(img)

for box in boxes:
    # Convert normalized coordinates (0-1000) to pixel coordinates
    scaled = scale_box_to_pixels(box, width=img.width, height=img.height)
    top_left = scaled.top_left
    bottom_right = scaled.bottom_right
    
    # Draw rectangle
    draw.rectangle(
        [int(top_left.x), int(top_left.y), int(bottom_right.x), int(bottom_right.y)],
        outline="orange",
        width=3
    )
    
    # Draw label (text snippet describing this region)
    label = box.mention or "region"
    draw.text((int(top_left.x), max(int(top_left.y) - 18, 0)), label, fill="orange")

img.save(ANNOTATED_PATH)
print(f"Saved annotated image to {ANNOTATED_PATH}")
Perceptron’s models use a 0–1000 normalized coordinate system for all spatial outputs. Convert to pixel coordinates before rendering overlays. See the coordinate system page for conversion helpers and best practices.

CLI usage

perceptron caption <image_path> [--style concise|detailed] [--expects text|box|point]
Examples:
# Generate a concise caption
perceptron caption image.jpg --style concise

# Generate a detailed caption with bounding boxes
perceptron caption image.jpg --style detailed --expects box

Best practices

  • Targeted prompts: Ask for specific scene details instead of broad questions so the model knows exactly what to describe or point out.
  • Single intent per call: Issue one instruction at a time; chaining separate caption requests yields more reliable outputs than bundling multiple questions together.
  • Explicit detail levels: Tell the model when you need richer prose (e.g., “Provide a detailed caption with spatial context”) to unlock longer, more descriptive answers.
  • Structured outputs: Perceptron can return formatted data when you specify it up front—for example, “Describe the people in the image as JSON with keys hair_color, shirt_color, person_type.”
Run through the full Jupyter notebook here. Reach out to Perceptron support if you have questions.