Understand how Perceptron Mk1 counts tokens for images and video so you can estimate costs and optimize preprocessing pipelines.Documentation Index
Fetch the complete documentation index at: https://docs.perceptron.inc/llms.txt
Use this file to discover all available pages before exploring further.
Image Token Counting
The gateway smart-resizes any image whose native patch count exceeds the 6,144-patch cap before tokenization, preserving aspect ratio.- Native resolution: Processes images at their original resolution; supports a wide range of aspect ratios.
- Patch size: 16 × 16 pixels.
- Spatial merge size: 2 × 2 (4 patches into a single token).
- Token formula:
⌈width / 32⌉ × ⌈height / 32⌉. - Minimum: 256 patches → 64 tokens (smaller images auto-upscaled).
- Maximum: 6,144 patches → ~1,508 tokens for 16:9 inputs (resized dimensions must be divisible by 32, so the practical ceiling is ~1,508 tokens).
- Smart-resize is silent: no warning in the response. To keep deterministic control over input quality, pre-resize client-side before uploading.
Video Token Counting
Dynamic resolution and frame rate: samples video attarget_fps = 2, with the sampled frame count clamped to min_frames = 2 and max_frames = 256. Frames are smart-resized at the original aspect ratio so each frame fits within the total video patch budget max_patches_per_video = 131072.
- Patch size: 16 × 16 pixels.
- Spatial merge size: 2 × 2 = 4 spatial patches per token.
- Temporal patch size: 2 frames.
- Effective token cell: 32 × 32 pixels across 2 frames.
- Sampled frames:
clamp(duration_seconds × 2, 2, 256), rounded to a multiple of 2. - Token formula:
ceil(sampled_frames / 2) × ceil(width / 32) × ceil(height / 32). - Max video tokens:
131072is the per-video patch budget; the effective cap is about 16K video tokens per video.
Constraints
- Context window: 32K tokens (image + video + text + reasoning + answer all share the same budget).
- Supported MIME types:
image/png,image/jpeg,image/webp,video/mp4,video/webm.
Pricing
Pricing for Perceptron Mk1:- Input: $0.15 per million tokens ($0.15/MT)
- Output: $1.50 per million tokens ($1.50/MT)
Common Image Sizes
Token counts and costs for common image resolutions at native resolution.| Resolution | Dimensions | Native Patches | Tokens billed | Cost (Input) | Per 1K Images |
|---|---|---|---|---|---|
| 512×512 | 512×512 | 1,024 | 256 | $0.0000384 | $0.04 |
| VGA | 640×480 | 1,200 | 300 | $0.000045 | $0.05 |
| HD (720p) | 1280×720 | 3,600 | 900 | $0.000135 | $0.14 |
| 1024×1024 | 1024×1024 | 4,096 | 1,024 | $0.0001536 | $0.15 |
| Full HD (1080p) | 1920×1080 | 8,160 | 1,508* | $0.000226 | $0.23 |
| 2K | 2560×1440 | 14,400 | 1,508* | $0.000226 | $0.23 |
| 4K | 3840×2160 | 32,400 | 1,508* | $0.000226 | $0.23 |
| 8K | 7680×4320 | 129,600 | 1,508* | $0.000226 | $0.23 |
*Anything at or above ~Full HD exceeds the 6,144-patch cap and is server-side smart-resized down to ~1,508 tokens before tokenization (16:9 aspect ratio). The native counts in the “Native Patches” column are what the image would produce at original resolution — you’re billed for the post-resize tokens. Pre-resize client-side if you want to control resize quality or use a specific target resolution.
Common Video Costs
Attarget_fps = 2, frames sampled per second of input video. Tokens per video computed via the token formula above.
| Duration | Sampled Frames | Resolution | Tokens | Cost (Input) |
|---|---|---|---|---|
| 5 s | 10 | 720p (1280×720) | 4,600 | $0.000690 |
| 5 s | 10 | 1080p (1920×1080) | 10,200 | $0.001530 |
| 10 s | 20 | 720p | 9,200 | $0.001380 |
| 10 s | 20 | 1080p | 20,400 | $0.003060 |
| 30 s | 60 | 720p | 27,600 | $0.004140 |
| ≥128 s* | 256 (capped) | 720p smart-resized | ~16,000 | ~$0.00240 |
| 10 min* | 256 (capped) | smart-resized to fit budget | ~16,000 | ~$0.00240 |
*Beyond ~128 seconds, the sampler caps at 256 frames. Frames are smart-resized at the original aspect ratio so total patch count fits within
max_patches_per_video = 131072. The effective token cap is ~16K per video regardless of source duration or resolution. The 30 s × 1080p combination (61,200 tokens at native) exceeds the patch budget and triggers smart-resize in practice.Optimization Guidance
Recommended Resolutions
We recommend passing in the original resolution of the image. If the resolution approaches Mk1’s context budget, we recommend client-side preprocessing. Lower resolution can erode quality but may improve latency and reduce token counts.Video preprocessing
Practical implications of the Video Token Counting spec above:- Sampler caps clip length to ~128 seconds: at
target_fps = 2andmax_frames = 256, anything longer is truncated. - Frames are smart-resized at the original aspect ratio so the total patch count across all sampled frames fits within
max_patches_per_video = 131072. The more frames sampled, the lower each frame’s effective resolution.