Prompting reference - Perceptron Docs

Perceptron Mk1 takes a top-level vision_config body field to trigger thinking and grounding. See the API reference for details.

Quick reference

Task	SDK Helper	Optimal Prompt
Concise caption	`caption(style="concise")`	`Provide a concise, human-friendly caption for the upcoming image.`
Detailed caption	`caption(style="detailed")`	`Provide a detailed caption describing key objects, relationships, and context in the upcoming image.`
OCR	`ocr()`	System: `You are an OCR system. Accurately detect, extract, and transcribe all readable text from the image.`
General detection	`detect()`	`Your goal is to segment out the objects in the scene`
Class detection	`detect(classes=[...])`	`Your goal is to segment out the following categories: {categories}`
Visual Q&A	`question()`	Pass your question directly as user content
Grounded Q&A	`question(expects="box")`	Same question, model returns boxes with answers
Counting	`question()`	`How many {objects} are there? Point to each.`
Video Clipping	`question(video(...), expects="clip")`	`Clip the moment {event}.`

Caption

Style	Prompt
`concise`	`Provide a concise, human-friendly caption for the upcoming image.`
`detailed`	`Provide a detailed caption describing key objects, relationships, and context in the upcoming image.`

SDK

from perceptron import configure, caption

configure(provider="perceptron", api_key="YOUR_API_KEY")

result = caption("image.jpg", style="concise")
print(result.text)

curl

curl -X POST "https://api.perceptron.inc/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $PERCEPTRON_API_KEY" \
  -d '{
  "model": "perceptron-mk1",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": "<image-url>"}},
        {"type": "text", "text": "Provide a concise, human-friendly caption for the upcoming image."}
      ]
    }
  ],
  "vision_config": { "enable_thinking": true }
}'

OCR

System instruction:

You are an OCR (Optical Character Recognition) system. Accurately detect, extract, and transcribe all readable text from the image.

SDK

from perceptron import configure, ocr

configure(provider="perceptron", api_key="YOUR_API_KEY")

result = ocr("document.png")
print(result.text)

# With custom prompt
result = ocr("document.png", prompt="Extract the table data as CSV")
print(result.text)

curl

curl -X POST "https://api.perceptron.inc/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $PERCEPTRON_API_KEY" \
  -d '{
  "model": "perceptron-mk1",
  "messages": [
    {
      "role": "system",
      "content": [
        {"type": "text", "text": "You are an OCR (Optical Character Recognition) system. Accurately detect, extract, and transcribe all readable text from the image."}
      ]
    },
    {
      "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": "<image-url>"}}
      ]
    }
  ],
  "vision_config": { "enable_thinking": true }
}'

Detect

Mode	Prompt
General	`Your goal is to segment out the objects in the scene`
With classes	`Your goal is to segment out the following categories: {categories}`

SDK

from perceptron import configure, detect

configure(provider="perceptron", api_key="YOUR_API_KEY")

result = detect("warehouse.jpg", classes=["forklift", "person", "pallet"])

for box in result.points or []:
    print(f"{box.mention}: ({box.top_left.x}, {box.top_left.y}) to ({box.bottom_right.x}, {box.bottom_right.y})")

curl

curl -X POST "https://api.perceptron.inc/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $PERCEPTRON_API_KEY" \
  -d '{
  "model": "perceptron-mk1",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": "<image-url>"}},
        {"type": "text", "text": "Your goal is to segment out the following categories: forklift, person, pallet"}
      ]
    }
  ],
  "vision_config": { "annotation_format": "box" }
}'

Question

Pass your question directly as user content. For grounded responses, set expects="box" or expects="point".

SDK

from perceptron import configure, question

configure(provider="perceptron", api_key="YOUR_API_KEY")

# Simple Q&A
result = question("factory.jpg", "How many workers are visible?")
print(result.text)

# Grounded Q&A (with bounding boxes)
result = question("factory.jpg", "Where is the safety equipment?", expects="box")
for box in result.points or []:
    print(f"{box.mention}: ({box.top_left.x}, {box.top_left.y})")

curl

curl -X POST "https://api.perceptron.inc/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $PERCEPTRON_API_KEY" \
  -d '{
  "model": "perceptron-mk1",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": "<image-url>"}},
        {"type": "text", "text": "Where is the safety equipment?"}
      ]
    }
  ],
  "vision_config": { "annotation_format": "box" }
}'

Clip (video temporal segments)

Use expects="clip" to ask the model to localize when an event happens in a video. The model returns its answer with inline self-closing <clip /> tags, which the SDK parses into Clip objects with start (and optional end) timestamps. Available on Perceptron Mk1.

Prompt shape	Example
Single event	`Clip the exact moment {event}.`
Multiple events	`Clip every {event}. Use the <clip> tag for each occurrence.`
Event + justification	`Is {condition} true? Return a clip to justify your answer. Use the <clip> tag to specify clips.`

SDK

from perceptron import configure, question, video

configure(provider="perceptron", model="perceptron-mk1", api_key="YOUR_API_KEY")

result = question(
    video("highlights.mp4"),
    "Clip the exact moment the ball passes through the hoop.",
    expects="clip",
    reasoning=True,
)

print(result.text)
for clip in result.clips or []:
    ts = clip.timestamp
    window = f"@{ts.at:.2f}s" if ts.until is None else f"{ts.at:.2f}s → {ts.until:.2f}s"
    print(f"{window} — {clip.mention or '(no mention)'}")

curl

curl -X POST "https://api.perceptron.inc/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $PERCEPTRON_API_KEY" \
  -d '{
  "model": "perceptron-mk1",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "video_url", "video_url": {"url": "<video-url>"}},
        {"type": "text", "text": "Clip the exact moment the ball passes through the hoop."}
      ]
    }
  ],
  "vision_config": { "annotation_format": "clip", "enable_thinking": true }
}'

The model emits self-closing <clip /> tags. The mention is an attribute, not body text; timestamps are whitespace-separated with the literal unit seconds:

<clip mention="made shot" t="3.2 seconds" />                  <!-- single moment -->
<clip mention="drive to the basket" t="3.2 seconds 5.1 seconds" />  <!-- range -->

Multiple clips for the same event are typically grouped in a <collection> whose mention is inherited by any child clip that omits its own:

<collection mention="ramp trick">
  <clip t="7.6 seconds 9.7 seconds" />
</collection>

When clip.timestamp.until is None, the model is pointing at an instant rather than a span.

Grounding on Perceptron Mk1 (`vision_config` body field)

Mk1 takes a top-level vision_config object. Pick the right enable_thinking value for your task: on for text Q&A and clip, off for point/box/polygon.

Example: spatial detection (thinking off)

curl -X POST "https://api.perceptron.inc/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $PERCEPTRON_API_KEY" \
  -d '{
  "model": "perceptron-mk1",
  "messages": [
    { "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": "<image-url>"}},
        {"type": "text", "text": "Find all the safety equipment."}
      ]
    }
  ],
  "vision_config": { "annotation_format": "box" }
}'

Example: text reasoning (thinking on)

curl -X POST "https://api.perceptron.inc/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $PERCEPTRON_API_KEY" \
  -d '{
  "model": "perceptron-mk1",
  "messages": [
    { "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": "<image-url>"}},
        {"type": "text", "text": "Are all workers properly equipped? Explain why each piece of gear matters."}
      ]
    }
  ],
  "vision_config": { "enable_thinking": true }
}'

Field reference for vision_config:

Field	Values	Purpose
`annotation_format`	`point` / `box` / `polygon` / `clip`	Grounded output format. `clip` is video-only.
`enable_thinking`	`true` / `false`	Chain-of-thought reasoning.
`internal_tools.focus`	`true` / `false`	Let the model zoom into a region and call itself again on that crop.

Advanced: `@perceive` decorator

For full control over prompts, reasoning, and structured output.

With reasoning

from perceptron import configure, perceive, image, text

configure(provider="perceptron", api_key="YOUR_API_KEY")

@perceive(model="perceptron-mk1", max_tokens=4096, reasoning=True)
def count_objects(img_url: str, query: str):
    return image(img_url) + text(query)

result = count_objects(
    "https://raw.githubusercontent.com/perceptron-ai-inc/perceptron/main/cookbook/_shared/assets/capabilities/caption/suburban_street.webp",
    "Count the number of cars, excluding buses. Return JSON."
)
print(result.text)

With structured output (Pydantic)

from pydantic import BaseModel, Field
from typing import Literal
from perceptron import configure, perceive, image, text, pydantic_format

configure(provider="perceptron", api_key="YOUR_API_KEY")

class SceneAnalysis(BaseModel):
    scene_type: Literal["urban", "nature"]
    main_subjects: list[str] = Field(description="Primary objects in the scene")
    mood: Literal["energetic", "peaceful", "tense"]
    time_of_day: Literal["day", "night", "unknown"]

@perceive(model="perceptron-mk1", response_format=pydantic_format(SceneAnalysis))
def analyze_scene(img_path: str):
    return image(img_path) + text("Analyze this scene. Output in JSON with scene type, subjects, mood and time of day.")

result = analyze_scene("photo.jpg")
analysis = SceneAnalysis.model_validate_json(result.text)
print(f"Scene type: {analysis.scene_type}")
print(f"Subjects: {analysis.main_subjects}")
print(f"Mood: {analysis.mood}")
print(f"Time: {analysis.time_of_day}")

​Quick reference

​Caption

​SDK

​curl

​OCR

​SDK

​curl

​Detect

​SDK

​curl

​Question

​SDK

​curl

​Clip (video temporal segments)

​SDK

​curl

​Grounding on Perceptron Mk1 (vision_config body field)

​Example: spatial detection (thinking off)

​Example: text reasoning (thinking on)

​Advanced: @perceive decorator

​With reasoning

​With structured output (Pydantic)

Quick reference

Caption

SDK

curl

OCR

SDK

curl

Detect

SDK

curl

Question

SDK

curl

Clip (video temporal segments)

SDK

curl

Grounding on Perceptron Mk1 (`vision_config` body field)

Example: spatial detection (thinking off)

Example: text reasoning (thinking on)

Advanced: `@perceive` decorator

With reasoning

With structured output (Pydantic)