Tokenization guide - Perceptron Docs

Understand how Perceptron Mk1 counts tokens for images and video so you can estimate costs and optimize preprocessing pipelines.

Image Token Counting

The gateway smart-resizes any image whose native patch count exceeds the 6,144-patch cap before tokenization, preserving aspect ratio.

Native resolution: Processes images at their original resolution; supports a wide range of aspect ratios.
Patch size: 16 × 16 pixels.
Spatial merge size: 2 × 2 (4 patches into a single token).
Token formula: ⌈width / 32⌉ × ⌈height / 32⌉.
Minimum: 256 patches → 64 tokens (smaller images auto-upscaled).
Maximum: 6,144 patches → ~1,508 tokens for 16:9 inputs (resized dimensions must be divisible by 32, so the practical ceiling is ~1,508 tokens).
Smart-resize is silent: no warning in the response. To keep deterministic control over input quality, pre-resize client-side before uploading.

The token-formula table below shows native counts. Anything ≥ 6,144 patches gets resized down to ~1,508 tokens before the model sees it.

Video Token Counting

Dynamic resolution and frame rate: samples video at target_fps = 2, with the sampled frame count clamped to min_frames = 2 and max_frames = 256. Frames are smart-resized at the original aspect ratio so each frame fits within the total video patch budget max_patches_per_video = 131072.

Patch size: 16 × 16 pixels.
Spatial merge size: 2 × 2 = 4 spatial patches per token.
Temporal patch size: 2 frames.
Effective token cell: 32 × 32 pixels across 2 frames.
Sampled frames: clamp(duration_seconds × 2, 2, 256), rounded to a multiple of 2.
Token formula: ceil(sampled_frames / 2) × ceil(width / 32) × ceil(height / 32).
Max video tokens: 131072 is the per-video patch budget; the effective cap is about 16K video tokens per video.
Budget is shared across frames: the patch budget is split across all sampled frames, so more frames means lower per-frame resolution. Any clip whose native token count would exceed 16K is smart-resized (and, past 256 frames, frame-sampled) down to fit.

Constraints

Context window: 32K tokens (image + video + text + reasoning + answer all share the same budget).
Supported MIME types: image/png, image/jpeg, image/webp, video/mp4, video/webm.

Pricing

Pricing for Perceptron Mk1:

Input: $0.15 per million tokens ($0.15/MT)
Output: $1.50 per million tokens ($1.50/MT)

Common Image Sizes

Token counts and costs for common image resolutions at native resolution.

Resolution	Dimensions	Native Patches	Tokens billed	Cost (Input)	Per 1K Images
512×512	512×512	1,024	256	$0.0000384	$0.04
VGA	640×480	1,200	300	$0.000045	$0.05
HD (720p)	1280×720	3,600	900	$0.000135	$0.14
1024×1024	1024×1024	4,096	1,024	$0.0001536	$0.15
Full HD (1080p)	1920×1080	8,160	1,508*	$0.000226	$0.23
2K	2560×1440	14,400	1,508*	$0.000226	$0.23
4K	3840×2160	32,400	1,508*	$0.000226	$0.23
8K	7680×4320	129,600	1,508*	$0.000226	$0.23

*Anything at or above ~Full HD exceeds the 6,144-patch cap and is server-side smart-resized down to ~1,508 tokens before tokenization (16:9 aspect ratio). The native counts in the “Native Patches” column are what the image would produce at original resolution — you’re billed for the post-resize tokens. Pre-resize client-side if you want to control resize quality or use a specific target resolution.

Common Video Costs

At target_fps = 2, frames sampled per second of input video. Tokens per video computed via the token formula above, then capped at ~16K per video.

Duration	Sampled Frames	Resolution	Tokens	Cost (Input)
5 s	10	720p (1280×720)	4,600	$0.000690
5 s	10	1080p (1920×1080)	10,200	$0.001530
10 s	20	720p	9,200	$0.001380
10 s	20	1080p	~16,000*	~$0.00240
30 s	60	720p	~16,000*	~$0.00240
30 s	60	1080p	~16,000*	~$0.00240
≥128 s	256 (capped)	any	~16,000*	~$0.00240
10 min	256 (capped)	any	~16,000*	~$0.00240

*Clips whose native token count would exceed the ~16K per-video cap are smart-resized at the original aspect ratio (and, past 256 frames, frame-sampled) down to fit. The cap holds regardless of source duration or resolution, so budget for at most ~16K tokens per video.

Optimization Guidance

Recommended Resolutions

We recommend passing in the original resolution of the image. If the resolution approaches Mk1’s context budget, we recommend client-side preprocessing. Lower resolution can erode quality but may improve latency and reduce token counts.

Video preprocessing

Practical implications of the Video Token Counting spec above:

Sampler caps clip length to ~128 seconds: at target_fps = 2 and max_frames = 256, anything longer is truncated.
Frames are smart-resized at the original aspect ratio so the total patch count across all sampled frames fits within max_patches_per_video = 131072. The more frames sampled, the lower each frame’s effective resolution.

For batch processing, consider pre-resizing all images to a consistent resolution to optimize both quality and cost at scale.

​Image Token Counting

​Video Token Counting

​Constraints

​Pricing

​Common Image Sizes

​Common Video Costs

​Optimization Guidance

​Recommended Resolutions

​Video preprocessing

Image Token Counting

Video Token Counting

Constraints

Pricing

Common Image Sizes

Common Video Costs

Optimization Guidance

Recommended Resolutions

Video preprocessing