Understand how Perceptron’s vision-language models count tokens for images and video so you can estimate costs and optimize preprocessing pipelines.Documentation Index
Fetch the complete documentation index at: https://docs.perceptron.inc/llms.txt
Use this file to discover all available pages before exploring further.
Isaac 0.1 and 0.2 Token Counting
Isaac 0.1, Isaac 0.2 1B, and Isaac 0.2 2B (Preview) all use the same patch-based image tokenizer:- Native resolution: Processes images at their original resolution; supports a wide range of aspect ratios.
- Patch size: 16 × 16 pixels.
- Spatial merge size: 2 × 2 (4 patches into a single token).
- Token formula:
⌈width / 32⌉ × ⌈height / 32⌉.
Constraints
- Minimum: 256 patches → 64 tokens.
- Maximum: 6,144 patches → 1,536 tokens.
- Images outside these bounds are automatically resized while maintaining aspect ratio.
- Due to the resize algorithm (dimensions must be divisible by 32), the practical maximum is typically around 1,508 tokens for common aspect ratios.
Calculation Examples
Example 1: 640×480 (VGA) — No Resize Needed- Round dimensions to nearest multiple of 32: 640×480 (already divisible).
- Calculate patches: (640 ÷ 16) × (480 ÷ 16) = 40 × 30 = 1,200 patches.
- Calculate tokens: 1,200 patches ÷ 4 = 300 tokens.
- Check constraints: 256 ≤ 1,200 ≤ 6,144 ✓ (no resize needed).
- Cost (Isaac 0.x at $0.15/M input): 300 × ($0.15 / 1,000,000) = $0.000045.
- Calculate original patches: (1920 ÷ 16) × (1080 ÷ 16) = 120 × 68 = 8,160 patches.
- Check constraints: 8,160 > 6,144 (exceeds maximum, resize needed).
- Resize to 1664×928 (maintains ~16:9 aspect ratio, divisible by 32).
- Calculate new patches: (1664 ÷ 16) × (928 ÷ 16) = 104 × 58 = 6,032 patches.
- Calculate tokens: 6,032 patches ÷ 4 = 1,508 tokens.
- Cost (Isaac 0.x at $0.15/M input): 1,508 × ($0.15 / 1,000,000) = $0.000226.
Perceptron Mk1 Token Counting
Image tokenization
- Native resolution: Processes images at their original resolution; supports a wide range of aspect ratios.
- Patch size: 16 × 16 pixels.
- Spatial merge size: 2 × 2 (4 patches into a single token).
- Token formula:
⌈width / 32⌉ × ⌈height / 32⌉.
Video tokenization
Dynamic resolution and frame rate: samples video attarget_fps = 2, with the sampled frame count clamped to min_frames = 2 and max_frames = 256. Frames are smart-resized at the original aspect ratio so each frame fits within the total video patch budget max_patches_per_video = 131072.
- Patch size: 16 × 16 pixels.
- Spatial merge size: 2 × 2 = 4 spatial patches per token.
- Temporal patch size: 2 frames.
- Effective token cell: 32 × 32 pixels across 2 frames.
- Sampled frames:
clamp(duration_seconds × 2, 2, 256), rounded to a multiple of 2. - Token formula:
ceil(sampled_frames / 2) × ceil(width / 32) × ceil(height / 32). - Max video tokens:
131072is the per-video patch budget; the effective cap is about 16K video tokens per video.
Constraints
- Context window: 32K tokens (image + video + text + reasoning + answer all share the same budget).
- Output tokens: 8K maximum.
- Supported MIME types:
image/png,image/jpeg,image/webp,video/mp4,video/webm.
Common Image Sizes (Isaac 0.x)
Token counts and costs for common image resolutions on the 0.x family. Pricing: $0.15 per million input tokens.| Resolution | Dimensions | Tokens | Cost (Input) | Per 1K Images |
|---|---|---|---|---|
| 512×512 | 512×512 | 256 | $0.000038 | $0.04 |
| VGA | 640×480 | 300 | $0.000045 | $0.05 |
| HD (720p) | 1280×720 | 920 | $0.000138 | $0.14 |
| 1024×1024 | 1024×1024 | 1,024 | $0.000154 | $0.15 |
| Full HD (1080p) | 1920×1080 | 1,508* | $0.000226 | $0.23 |
| 2K | 2560×1440 | 1,508* | $0.000226 | $0.23 |
| 4K | 3840×2160 | 1,508* | $0.000226 | $0.23 |
| 8K | 7680×4320 | 1,508* | $0.000226 | $0.23 |
*Isaac 0.x automatically resizes images exceeding 6,144 patches to fit within this limit while maintaining aspect ratio. Due to the resize algorithm (dimensions must be divisible by 32), the practical maximum is 1,508 tokens (6,032 patches at 1664×928 for 16:9 aspect ratio).
Optimization Guidance
Recommended Resolutions
We recommend passing in the original resolution of the image. If the resolution is greater than the maximum supported, we recommend client-side preprocessing. Lower resolution can erode quality but may improve latency and reduce token counts.Client-Side Preprocessing
You can resize images before sending them to reduce token usage and costs: When to Resize:- Below minimum (Isaac 0.x): If your images are smaller than 256 patches for Isaac 0.x, resize them yourself to avoid automatic upscaling.
- Above maximum (Isaac 0.x): If your images exceed 6,144 patches for Isaac 0.x, resize them yourself to maintain control over quality.
- Resize to multiples of 32: When resizing, aim for dimensions divisible by 32 (e.g., 1280×720, 1024×1024, 1920×1088) to avoid additional processing overhead.
- Maintain aspect ratio: Preserve original proportions to avoid distortion.
- Faster uploads: Pre-resized images reduce bandwidth usage.
Video preprocessing (Perceptron Mk1)
Practical implications of the Video tokenization spec above:- Sampler caps clip length to ~128 seconds: at
target_fps = 2andmax_frames = 256, anything longer is truncated. - Frames are smart-resized at the original aspect ratio so the total patch count across all sampled frames fits within
max_patches_per_video = 131072. The more frames sampled, the lower each frame’s effective resolution.