Quick reference
| Task | SDK Helper | Optimal Prompt |
|---|---|---|
| Concise caption | caption(style="concise") | Provide a concise, human-friendly caption for the upcoming image. |
| Detailed caption | caption(style="detailed") | Provide a detailed caption describing key objects, relationships, and context in the upcoming image. |
| OCR | ocr() | System: You are an OCR system. Accurately detect, extract, and transcribe all readable text from the image. |
| General detection | detect() | Your goal is to segment out the objects in the scene |
| Class detection | detect(classes=[...]) | Your goal is to segment out the following categories: {categories} |
| Visual Q&A | question() | Pass your question directly as user content |
| Grounded Q&A | question(expects="box") | Same question, model returns boxes with answers |
| Counting | question() | How many {objects} are there? Point to each. |
Caption
| Style | Prompt |
|---|---|
concise | Provide a concise, human-friendly caption for the upcoming image. |
detailed | Provide a detailed caption describing key objects, relationships, and context in the upcoming image. |
SDK
curl
OCR
System instruction:SDK
curl
Detect
| Mode | Prompt |
|---|---|
| General | Your goal is to segment out the objects in the scene |
| With classes | Your goal is to segment out the following categories: {categories} |
SDK
curl
Question
Pass your question directly as user content. For grounded responses, setexpects="box" or expects="point".
SDK
curl
Grounding hints
When using the API directly, you can request specific output geometry using hint tags in the system message:| Hint | Output Type | Use Case |
|---|---|---|
<hint>BOX</hint> | Bounding boxes | Object detection, region selection |
<hint>POINT</hint> | Single points | Pointing, counting |
<hint>POLYGON</hint> | Polygons | Segmentation, irregular shapes |
<hint>THINK</hint> | Reasoning traces | Chain-of-thought, complex analysis |