Skip to main content
The Perceptron Vision MCP Server enables AI assistants to access Perceptron’s powerful vision-language capabilities directly within their workflows. Built on the Model Context Protocol, it gives your agents the ability to see and reason about images — captioning, object detection, OCR, and visual Q&A — without writing any integration code. With native local file support, your agent can point to any local image — the server handles the rest, uploading it directly to the Perceptron Vision MCP Server.

Before you begin

You need the following to use the MCP server:
  • Perceptron API key — to authenticate requests from the MCP server.
  • Node.js (LTS) — required to run the MCP server package via npx.

Create an API key

Get your key from the Perceptron platform

Quick setup

Get started instantly with one-click installers:

Install in Cursor

Install in VS Code

Or follow the manual setup steps below.

Manual setup

Run the following command (recommended):
claude mcp add perceptron -e PERCEPTRON_API_KEY=YOUR_API_KEY -- npx -y @perceptron-ai/mcp-server@latest
Replace YOUR_API_KEY with your actual Perceptron API key.

Available tools

The MCP server provides four tools that give your AI agent vision capabilities:

caption

Generate a natural-language caption for an image. Ideal for describing screenshots, photos, or any visual content your agent encounters.

detect

Detect and locate objects in an image. Returns bounding boxes and labels for identified objects — perfect for analyzing UI mockups, counting items, or understanding scene composition.

ocr

Extract text from an image using optical character recognition. Use it to read receipts, documents, signs, or any image containing text.

question

Ask a question about an image and get an answer. Great for visual Q&A tasks like identifying colors, reading labels, or understanding context in a photo.
Each tool works directly with local image files — no need to upload or host images. The MCP server reads files locally and sends them directly to the API, avoiding large base64 payloads in the conversation context for fast, lightweight processing. Results include text responses and optional grounded geometry (points, boxes, or polygons) on a normalized 0-1000 coordinate system.
See the Models section for all available model IDs, or call list_resources at runtime.

Example usage

Once connected, your AI agent can call Perceptron tools directly. Here are some example prompts:
  • “Caption this screenshot” — the agent calls caption and returns a description
  • “Find all the buttons in this UI mockup” — the agent calls detect with the relevant classes
  • “Read the text from this receipt” — the agent calls ocr to extract structured text
  • “What color is the car in this photo?” — the agent calls question with your query
For troubleshooting and additional details, visit our GitHub repository. Reach out to Perceptron support or join our Discord if you have questions.