Tokenization guide

Understand how Perceptron’s vision-language models count tokens for images and video so you can estimate costs and optimize preprocessing pipelines.

Isaac 0.1 and 0.2 Token Counting

Isaac 0.1, Isaac 0.2 1B, and Isaac 0.2 2B (Preview) all use the same patch-based image tokenizer:

Native resolution: Processes images at their original resolution; supports a wide range of aspect ratios.
Patch size: 16 × 16 pixels.
Spatial merge size: 2 × 2 (4 patches into a single token).
Token formula: ⌈width / 32⌉ × ⌈height / 32⌉.

Constraints

Minimum: 256 patches → 64 tokens.
Maximum: 6,144 patches → 1,536 tokens.
Images outside these bounds are automatically resized while maintaining aspect ratio.
Due to the resize algorithm (dimensions must be divisible by 32), the practical maximum is typically around 1,508 tokens for common aspect ratios.

Calculation Examples

Example 1: 640×480 (VGA) — No Resize Needed

Round dimensions to nearest multiple of 32: 640×480 (already divisible).
Calculate patches: (640 ÷ 16) × (480 ÷ 16) = 40 × 30 = 1,200 patches.
Calculate tokens: 1,200 patches ÷ 4 = 300 tokens.
Check constraints: 256 ≤ 1,200 ≤ 6,144 ✓ (no resize needed).
Cost (Isaac 0.x at $0.15/M input): 300 × ($0.15 / 1,000,000) = $0.000045.

Example 2: 1920×1080 (Full HD) — Requires Resize

Calculate original patches: (1920 ÷ 16) × (1080 ÷ 16) = 120 × 68 = 8,160 patches.
Check constraints: 8,160 > 6,144 (exceeds maximum, resize needed).
Resize to 1664×928 (maintains ~16:9 aspect ratio, divisible by 32).
Calculate new patches: (1664 ÷ 16) × (928 ÷ 16) = 104 × 58 = 6,032 patches.
Calculate tokens: 6,032 patches ÷ 4 = 1,508 tokens.
Cost (Isaac 0.x at $0.15/M input): 1,508 × ($0.15 / 1,000,000) = $0.000226.

Perceptron Mk1 Token Counting

Image tokenization

Native resolution: Processes images at their original resolution; supports a wide range of aspect ratios.
Patch size: 16 × 16 pixels.
Spatial merge size: 2 × 2 (4 patches into a single token).
Token formula: ⌈width / 32⌉ × ⌈height / 32⌉.

Video tokenization

Dynamic resolution and frame rate: samples video at target_fps = 2, with the sampled frame count clamped to min_frames = 2 and max_frames = 256. Frames are smart-resized at the original aspect ratio so each frame fits within the total video patch budget max_patches_per_video = 131072.

Patch size: 16 × 16 pixels.
Spatial merge size: 2 × 2 = 4 spatial patches per token.
Temporal patch size: 2 frames.
Effective token cell: 32 × 32 pixels across 2 frames.
Sampled frames: clamp(duration_seconds × 2, 2, 256), rounded to a multiple of 2.
Token formula: ceil(sampled_frames / 2) × ceil(width / 32) × ceil(height / 32).
Max video tokens: 131072 is the per-video patch budget; the effective cap is about 16K video tokens per video.

Constraints

Context window: 32K tokens (image + video + text + reasoning + answer all share the same budget).
Output tokens: 8K maximum.
Supported MIME types: image/png, image/jpeg, image/webp, video/mp4, video/webm.

Common Image Sizes (Isaac 0.x)

Token counts and costs for common image resolutions on the 0.x family. Pricing: $0.15 per million input tokens.

Resolution	Dimensions	Tokens	Cost (Input)	Per 1K Images
512×512	512×512	256	$0.000038	$0.04
VGA	640×480	300	$0.000045	$0.05
HD (720p)	1280×720	920	$0.000138	$0.14
1024×1024	1024×1024	1,024	$0.000154	$0.15
Full HD (1080p)	1920×1080	1,508*	$0.000226	$0.23
2K	2560×1440	1,508*	$0.000226	$0.23
4K	3840×2160	1,508*	$0.000226	$0.23
8K	7680×4320	1,508*	$0.000226	$0.23

*Isaac 0.x automatically resizes images exceeding 6,144 patches to fit within this limit while maintaining aspect ratio. Due to the resize algorithm (dimensions must be divisible by 32), the practical maximum is 1,508 tokens (6,032 patches at 1664×928 for 16:9 aspect ratio).

Optimization Guidance

Recommended Resolutions

We recommend passing in the original resolution of the image. If the resolution is greater than the maximum supported, we recommend client-side preprocessing. Lower resolution can erode quality but may improve latency and reduce token counts.

Client-Side Preprocessing

You can resize images before sending them to reduce token usage and costs: When to Resize:

Below minimum (Isaac 0.x): If your images are smaller than 256 patches for Isaac 0.x, resize them yourself to avoid automatic upscaling.
Above maximum (Isaac 0.x): If your images exceed 6,144 patches for Isaac 0.x, resize them yourself to maintain control over quality.

Recommendations:

Resize to multiples of 32: When resizing, aim for dimensions divisible by 32 (e.g., 1280×720, 1024×1024, 1920×1088) to avoid additional processing overhead.
Maintain aspect ratio: Preserve original proportions to avoid distortion.
Faster uploads: Pre-resized images reduce bandwidth usage.

Video preprocessing (Perceptron Mk1)

Practical implications of the Video tokenization spec above:

Sampler caps clip length to ~128 seconds: at target_fps = 2 and max_frames = 256, anything longer is truncated.
Frames are smart-resized at the original aspect ratio so the total patch count across all sampled frames fits within max_patches_per_video = 131072. The more frames sampled, the lower each frame’s effective resolution.

For batch processing, consider pre-resizing all images to a consistent resolution to optimize both quality and cost at scale.

Get Started

Capabilities

Developer Guides

Scaling & deployment

Best practices

Isaac 0.1 and 0.2 Token Counting

Constraints

Calculation Examples

Perceptron Mk1 Token Counting

Image tokenization

Video tokenization

Constraints

Common Image Sizes (Isaac 0.x)

Optimization Guidance

Recommended Resolutions

Client-Side Preprocessing

Video preprocessing (Perceptron Mk1)

Get Started

Capabilities

Developer Guides

Scaling & deployment

Best practices

Documentation Index

​Isaac 0.1 and 0.2 Token Counting

​Constraints

​Calculation Examples

​Perceptron Mk1 Token Counting

​Image tokenization

​Video tokenization

​Constraints

​Common Image Sizes (Isaac 0.x)

​Optimization Guidance

​Recommended Resolutions

​Client-Side Preprocessing

​Video preprocessing (Perceptron Mk1)

Isaac 0.1 and 0.2 Token Counting

Constraints

Calculation Examples

Perceptron Mk1 Token Counting

Image tokenization

Video tokenization

Constraints

Common Image Sizes (Isaac 0.x)

Optimization Guidance

Recommended Resolutions

Client-Side Preprocessing

Video preprocessing (Perceptron Mk1)