Tips

Little prompt tweaks go a long way:

Input & output selection:
- Explicitly specify intended response style (text, points, boxes, polygons) instead of letting the model decide.
- Supply multiple images when in-context examples are useful.
Prompt engineering:
- Tighten language for single- vs. multi-target pointing.
- Trim verbosity.
- Ask for rationale or step-by-step thinking.

Examples of prompts by task type

Object detection (grounding)

Identify objects, people, animals—anything you can describe.

Bound the homes.

Object attribute detection (grounding)

Target roles, colors, or actions (e.g., “the person in a yellow vest”).

Bound the blue books.

Counting (grounding)

Tally every instance of a specified object or entity.

How many boxes are there?

Scene captioning (VQA)

Ask open-ended questions about the scene for narrated summaries.

Describe what’s happening in detail.

OCR (VQA)

Read signage, labels, and handwriting while preserving spatial context.

What does this say?

Prompt engineering tips

Use these snippets when you want to tighten instructions and better control the model’s output.

Single-target pointing

Natural language usually works (e.g., “Where is the forklift?”). If precision drops, anchor the task:

Your goal is to segment out the following category: INSERT_TARGET.

Multi-target pointing

Request multiple targets with a comma-separated list:

Your goal is to segment out the following categories: TARGET_1, TARGET_2, TARGET_3.

Different orderings can change results—experiment when classes overlap.

Concise responses

Trim verbose answers by appending one of these lines:

Respond in 1–2 sentences.

Be concise.

Accurate rationale

If the model’s reasoning drifts, nudge it toward step-by-step thinking:

Think step by step.

Counting with grounding

For precise counts, include grounded instructions such as:

How many crates are there? Point to each.

In-context learning

When words fall short, show Isaac what you mean—attach exemplars of defects, rare components, or tricky textures so the model can mimic your labels.

Leveraging the demo to iterate on prompts

Use the demo controls to trial new data combinations and rapidly compare prompt variations.

Adding multiple images as input

Use the “+” button in the image carousel to supply extra reference shots or even multiple carousels. This unlocks in-context learning when text alone can’t describe the target.

Choosing response style

Explicitly pick the response style (text-only vs. grounded outputs) so results match your downstream UI.

Get Started

Capabilities

Developer Guides

Scaling & deployment

Best practices

Prompting guide

Tips

Examples of prompts by task type

Object detection (grounding)

Object attribute detection (grounding)

Counting (grounding)

Scene captioning (VQA)

OCR (VQA)

Prompt engineering tips

Single-target pointing

Multi-target pointing

Concise responses

Accurate rationale

Counting with grounding

In-context learning

Leveraging the demo to iterate on prompts

Adding multiple images as input

Choosing response style

Get Started

Capabilities

Developer Guides

Scaling & deployment

Best practices

​Tips

​Examples of prompts by task type

​Object detection (grounding)

​Object attribute detection (grounding)

​Counting (grounding)

​Scene captioning (VQA)

​OCR (VQA)

​Prompt engineering tips

​Single-target pointing

​Multi-target pointing

​Concise responses

​Accurate rationale

​Counting with grounding

​In-context learning

​Leveraging the demo to iterate on prompts

​Adding multiple images as input

​Choosing response style

Tips

Examples of prompts by task type

Object detection (grounding)

Object attribute detection (grounding)

Counting (grounding)

Scene captioning (VQA)

OCR (VQA)

Prompt engineering tips

Single-target pointing

Multi-target pointing

Concise responses

Accurate rationale

Counting with grounding

In-context learning

Leveraging the demo to iterate on prompts

Adding multiple images as input

Choosing response style