Documentation Index
Fetch the complete documentation index at: https://docs.perceptron.inc/llms.txt
Use this file to discover all available pages before exploring further.
Perceptron Mk1 takes a top-level vision_config body field to trigger thinking and grounding. See the API reference for details.
Quick reference
| Task | SDK Helper | Optimal Prompt |
|---|
| Concise caption | caption(style="concise") | Provide a concise, human-friendly caption for the upcoming image. |
| Detailed caption | caption(style="detailed") | Provide a detailed caption describing key objects, relationships, and context in the upcoming image. |
| OCR | ocr() | System: You are an OCR system. Accurately detect, extract, and transcribe all readable text from the image. |
| General detection | detect() | Your goal is to segment out the objects in the scene |
| Class detection | detect(classes=[...]) | Your goal is to segment out the following categories: {categories} |
| Visual Q&A | question() | Pass your question directly as user content |
| Grounded Q&A | question(expects="box") | Same question, model returns boxes with answers |
| Counting | question() | How many {objects} are there? Point to each. |
| Video Clipping | question(video(...), expects="clip") | Clip the moment {event}. |
Caption
| Style | Prompt |
|---|
concise | Provide a concise, human-friendly caption for the upcoming image. |
detailed | Provide a detailed caption describing key objects, relationships, and context in the upcoming image. |
SDK
from perceptron import configure, caption
configure(provider="perceptron", api_key="YOUR_API_KEY")
result = caption("image.jpg", style="concise")
print(result.text)
curl
curl -X POST "https://api.perceptron.inc/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $PERCEPTRON_API_KEY" \
-d '{
"model": "perceptron-mk1",
"messages": [
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "<image-url>"}},
{"type": "text", "text": "Provide a concise, human-friendly caption for the upcoming image."}
]
}
],
"vision_config": { "enable_thinking": true }
}'
OCR
System instruction:
You are an OCR (Optical Character Recognition) system. Accurately detect, extract, and transcribe all readable text from the image.
SDK
from perceptron import configure, ocr
configure(provider="perceptron", api_key="YOUR_API_KEY")
result = ocr("document.png")
print(result.text)
# With custom prompt
result = ocr("document.png", prompt="Extract the table data as CSV")
print(result.text)
curl
curl -X POST "https://api.perceptron.inc/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $PERCEPTRON_API_KEY" \
-d '{
"model": "perceptron-mk1",
"messages": [
{
"role": "system",
"content": [
{"type": "text", "text": "You are an OCR (Optical Character Recognition) system. Accurately detect, extract, and transcribe all readable text from the image."}
]
},
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "<image-url>"}}
]
}
],
"vision_config": { "enable_thinking": true }
}'
Detect
| Mode | Prompt |
|---|
| General | Your goal is to segment out the objects in the scene |
| With classes | Your goal is to segment out the following categories: {categories} |
SDK
from perceptron import configure, detect
configure(provider="perceptron", api_key="YOUR_API_KEY")
result = detect("warehouse.jpg", classes=["forklift", "person", "pallet"])
for box in result.points or []:
print(f"{box.mention}: ({box.top_left.x}, {box.top_left.y}) to ({box.bottom_right.x}, {box.bottom_right.y})")
curl
curl -X POST "https://api.perceptron.inc/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $PERCEPTRON_API_KEY" \
-d '{
"model": "perceptron-mk1",
"messages": [
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "<image-url>"}},
{"type": "text", "text": "Your goal is to segment out the following categories: forklift, person, pallet"}
]
}
],
"vision_config": { "annotation_format": "box" }
}'
Question
Pass your question directly as user content. For grounded responses, set expects="box" or expects="point".
SDK
from perceptron import configure, question
configure(provider="perceptron", api_key="YOUR_API_KEY")
# Simple Q&A
result = question("factory.jpg", "How many workers are visible?")
print(result.text)
# Grounded Q&A (with bounding boxes)
result = question("factory.jpg", "Where is the safety equipment?", expects="box")
for box in result.points or []:
print(f"{box.mention}: ({box.top_left.x}, {box.top_left.y})")
curl
curl -X POST "https://api.perceptron.inc/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $PERCEPTRON_API_KEY" \
-d '{
"model": "perceptron-mk1",
"messages": [
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "<image-url>"}},
{"type": "text", "text": "Where is the safety equipment?"}
]
}
],
"vision_config": { "annotation_format": "box" }
}'
Clip (video temporal segments)
Use expects="clip" to ask the model to localize when an event happens in a video. The model returns its answer with inline <clip t="..." mention="..."> tags, which the SDK parses into Clip objects with start (and optional end) timestamps. Available on Perceptron Mk1.
| Prompt shape | Example |
|---|
| Single event | Clip the exact moment {event}. |
| Multiple events | Clip every {event}. Use the <clip> tag for each occurrence. |
| Event + justification | Is {condition} true? Return a clip to justify your answer. Use the <clip> tag to specify clips. |
SDK
from perceptron import configure, question, video
configure(provider="perceptron", model="perceptron-mk1", api_key="YOUR_API_KEY")
result = question(
video("highlights.mp4"),
"Clip the exact moment the ball passes through the hoop.",
expects="clip",
reasoning=True,
)
print(result.text)
for clip in result.clips or []:
ts = clip.timestamp
window = f"@{ts.at:.2f}s" if ts.until is None else f"{ts.at:.2f}s → {ts.until:.2f}s"
print(f"{window} — {clip.mention or '(no mention)'}")
curl
curl -X POST "https://api.perceptron.inc/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $PERCEPTRON_API_KEY" \
-d '{
"model": "perceptron-mk1",
"messages": [
{
"role": "user",
"content": [
{"type": "video_url", "video_url": {"url": "<video-url>"}},
{"type": "text", "text": "Clip the exact moment the ball passes through the hoop."}
]
}
],
"vision_config": { "annotation_format": "clip", "enable_thinking": true }
}'
The model emits <clip t="3.2s">made shot</clip> (single moment) or <clip t="3.2s,5.1s">drive to the basket</clip> (range). When clip.timestamp.until is None, the model is pointing at an instant rather than a span.
Grounding on Perceptron Mk1 (vision_config body field)
Mk1 takes a top-level vision_config object.
Pick the right enable_thinking value for your task: on for text Q&A and clip, off for point/box/polygon.
Example: spatial detection (thinking off)
curl -X POST "https://api.perceptron.inc/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $PERCEPTRON_API_KEY" \
-d '{
"model": "perceptron-mk1",
"messages": [
{ "role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "<image-url>"}},
{"type": "text", "text": "Find all the safety equipment."}
]
}
],
"vision_config": { "annotation_format": "box" }
}'
Example: text reasoning (thinking on)
curl -X POST "https://api.perceptron.inc/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $PERCEPTRON_API_KEY" \
-d '{
"model": "perceptron-mk1",
"messages": [
{ "role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "<image-url>"}},
{"type": "text", "text": "Are all workers properly equipped? Explain why each piece of gear matters."}
]
}
],
"vision_config": { "enable_thinking": true }
}'
Field reference for vision_config:
| Field | Values | Purpose |
|---|
annotation_format | point / box / polygon / clip | Grounded output format. clip is video-only. |
enable_thinking | true / false | Chain-of-thought reasoning. |
internal_tools.focus | true / false | Let the model zoom into a region and call itself again on that crop. |
Advanced: @perceive decorator
For full control over prompts, reasoning, and structured output.
With reasoning
from perceptron import configure, perceive, image, text
configure(provider="perceptron", api_key="YOUR_API_KEY")
@perceive(model="perceptron-mk1", max_tokens=4096, reasoning=True)
def count_objects(img_url: str, query: str):
return image(img_url) + text(query)
result = count_objects(
"https://raw.githubusercontent.com/perceptron-ai-inc/perceptron/main/cookbook/_shared/assets/capabilities/caption/suburban_street.webp",
"Count the number of cars, excluding buses. Return JSON."
)
print(result.text)
With structured output (Pydantic)
from pydantic import BaseModel, Field
from typing import Literal
from perceptron import configure, perceive, image, text, pydantic_format
configure(provider="perceptron", api_key="YOUR_API_KEY")
class SceneAnalysis(BaseModel):
scene_type: Literal["urban", "nature"]
main_subjects: list[str] = Field(description="Primary objects in the scene")
mood: Literal["energetic", "peaceful", "tense"]
time_of_day: Literal["day", "night", "unknown"]
@perceive(model="perceptron-mk1", response_format=pydantic_format(SceneAnalysis))
def analyze_scene(img_path: str):
return image(img_path) + text("Analyze this scene. Output in JSON with scene type, subjects, mood and time of day.")
result = analyze_scene("photo.jpg")
analysis = SceneAnalysis.model_validate_json(result.text)
print(f"Scene type: {analysis.scene_type}")
print(f"Subjects: {analysis.main_subjects}")
print(f"Mood: {analysis.mood}")
print(f"Time: {analysis.time_of_day}")