Image Q&A

Isaac 0.3 Max:

Isaac 0.1 / 0.2:

The question() helper takes an image() (or video()) node alongside a natural-language prompt and returns a textual answer plus optional grounded citations (points, boxes, or polygons). Use it for operator checklists, product audits, and narrated walkthroughs.

Basic usage

from perceptron import image, question

result = question(
    image(image_path),         # ImageNode wrapping a path/URL/bytes
    "What stands out?",        # str: Natural-language question
    expects="text",            # str: "text" | "point" | "box" | "polygon"
    reasoning=True,            # bool: enable reasoning and include chain-of-thought
)

print(result.reasoning)        # Chain-of-thought (None when reasoning=False)
print(result.text)

# Access grounded evidence (bucket depends on `expects`)
for box in result.boxes or []:
    print(box.mention, box)

Parameters:

Parameter	Type	Default	Description
`media_obj`	`MediaNode`	-	Wrap your image (path, URL, or bytes) with `image()`. For video inputs use `video()` and see the Video Q&A page.
`question_text`	`str`	-	The question to ask about the scene
`expects`	`str`	`"text"`	Desired output structure for the SDK (`"text"`, `"point"`, `"box"`, `"polygon"`)
`reasoning`	`bool`	`False`	Set `True` to enable reasoning and include the model’s chain-of-thought
`format`	`str`	`"text"`	CLI output schema; choose `"text"` for Rich summaries or `"json"` for machine-readable results

format is available only through the CLI flag (--format text|json). The Python helper always returns a PerceiveResult.

Returns: PerceiveResult object:

text (str): Answer to your question.
reasoning (str | None): Chain-of-thought when reasoning=True.
boxes, points, polygons (list | None): Populated based on the expects you requested. Each list has its own boxes_to_pixels / points_to_pixels / polygons_to_pixels helper for normalized → pixel conversion.

Example: Studio scene walkthrough

In this example we download a photo of an outdoor scene, ask “What stands out in this studio?” and overlay the returned bounding boxes so operators can see cited evidence.

from pathlib import Path
from urllib.request import urlretrieve

from perceptron import configure, image, question
from PIL import Image as PILImage, ImageDraw

configure(
    provider="perceptron",
    model="isaac-0.3-max",
    api_key="YOUR_API_KEY",
)

# Download reference image
IMAGE_URL = "https://raw.githubusercontent.com/perceptron-ai-inc/perceptron/main/cookbook/_shared/assets/capabilities/qna/studio_scene.webp"
IMAGE_PATH = Path("studio_scene.webp")
ANNOTATED_PATH = Path("studio_scene_annotated.png")

if not IMAGE_PATH.exists():
    urlretrieve(IMAGE_URL, IMAGE_PATH)

# Ask a grounded question
result = question(
    image(str(IMAGE_PATH)),
    "What stands out in this studio scene? Call out props or people with boxes.",
    expects="box",
)

print(result.text)

# Draw citations
img = PILImage.open(IMAGE_PATH).convert("RGB")
draw = ImageDraw.Draw(img)
pixel_boxes = result.boxes_to_pixels(width=img.width, height=img.height) or []

for box in pixel_boxes:
    draw.rectangle(
        [
            int(box.top_left.x),
            int(box.top_left.y),
            int(box.bottom_right.x),
            int(box.bottom_right.y),
        ],
        outline="cyan",
        width=3,
    )
    label = box.mention or "answer"
    draw.text((int(box.top_left.x), max(int(box.top_left.y) - 18, 0)), label, fill="cyan")

img.save(ANNOTATED_PATH)
print(f"Saved annotated image to {ANNOTATED_PATH}")

All spatial outputs use a 0-1000 normalized coordinate system. Convert via result.points_to_pixels(width, height) before rendering overlays — see the coordinate system guide for more patterns.

CLI usage

Run image Q&A from the CLI by passing the image, question, and desired output preferences:

perceptron question <image_path_or_url> "<prompt>" [--expects text|point|box|polygon] [--format text|json] [--stream]

Examples:

# Text-only answer
perceptron question studio_scene.webp "What is on the desk?"

# Grounded citations with JSON output
perceptron question studio_scene.webp "Which lights are on?" --expects box --format json

The CLI auto-detects video paths (.mp4) and routes them to a video() node. See Video Q&A for the video-specific walkthrough.

Run through the full Jupyter notebook here. Reach out to Perceptron support if you have questions.

Get Started

Capabilities

Developer Guides

Scaling & deployment

Best practices

Basic usage

Example: Studio scene walkthrough

CLI usage

Get Started

Capabilities

Developer Guides

Scaling & deployment

Best practices

Documentation Index

​Basic usage

​Example: Studio scene walkthrough

​CLI usage

Basic usage

Example: Studio scene walkthrough

CLI usage