Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.perceptron.inc/llms.txt

Use this file to discover all available pages before exploring further.

Understand how Perceptron Mk1 counts tokens for images and video so you can estimate costs and optimize preprocessing pipelines.

Image Token Counting

The gateway smart-resizes any image whose native patch count exceeds the 6,144-patch cap before tokenization, preserving aspect ratio.
  • Native resolution: Processes images at their original resolution; supports a wide range of aspect ratios.
  • Patch size: 16 × 16 pixels.
  • Spatial merge size: 2 × 2 (4 patches into a single token).
  • Token formula: ⌈width / 32⌉ × ⌈height / 32⌉.
  • Minimum: 256 patches → 64 tokens (smaller images auto-upscaled).
  • Maximum: 6,144 patches → ~1,508 tokens for 16:9 inputs (resized dimensions must be divisible by 32, so the practical ceiling is ~1,508 tokens).
  • Smart-resize is silent: no warning in the response. To keep deterministic control over input quality, pre-resize client-side before uploading.
The token-formula table below shows native counts. Anything ≥ 6,144 patches gets resized down to ~1,508 tokens before the model sees it.

Video Token Counting

Dynamic resolution and frame rate: samples video at target_fps = 2, with the sampled frame count clamped to min_frames = 2 and max_frames = 256. Frames are smart-resized at the original aspect ratio so each frame fits within the total video patch budget max_patches_per_video = 131072.
  • Patch size: 16 × 16 pixels.
  • Spatial merge size: 2 × 2 = 4 spatial patches per token.
  • Temporal patch size: 2 frames.
  • Effective token cell: 32 × 32 pixels across 2 frames.
  • Sampled frames: clamp(duration_seconds × 2, 2, 256), rounded to a multiple of 2.
  • Token formula: ceil(sampled_frames / 2) × ceil(width / 32) × ceil(height / 32).
  • Max video tokens: 131072 is the per-video patch budget; the effective cap is about 16K video tokens per video.

Constraints

  • Context window: 32K tokens (image + video + text + reasoning + answer all share the same budget).
  • Supported MIME types: image/png, image/jpeg, image/webp, video/mp4, video/webm.

Pricing

Pricing for Perceptron Mk1:
  • Input: $0.15 per million tokens ($0.15/MT)
  • Output: $1.50 per million tokens ($1.50/MT)

Common Image Sizes

Token counts and costs for common image resolutions at native resolution.
ResolutionDimensionsNative PatchesTokens billedCost (Input)Per 1K Images
512×512512×5121,024256$0.0000384$0.04
VGA640×4801,200300$0.000045$0.05
HD (720p)1280×7203,600900$0.000135$0.14
1024×10241024×10244,0961,024$0.0001536$0.15
Full HD (1080p)1920×10808,1601,508*$0.000226$0.23
2K2560×144014,4001,508*$0.000226$0.23
4K3840×216032,4001,508*$0.000226$0.23
8K7680×4320129,6001,508*$0.000226$0.23
*Anything at or above ~Full HD exceeds the 6,144-patch cap and is server-side smart-resized down to ~1,508 tokens before tokenization (16:9 aspect ratio). The native counts in the “Native Patches” column are what the image would produce at original resolution — you’re billed for the post-resize tokens. Pre-resize client-side if you want to control resize quality or use a specific target resolution.

Common Video Costs

At target_fps = 2, frames sampled per second of input video. Tokens per video computed via the token formula above.
DurationSampled FramesResolutionTokensCost (Input)
5 s10720p (1280×720)4,600$0.000690
5 s101080p (1920×1080)10,200$0.001530
10 s20720p9,200$0.001380
10 s201080p20,400$0.003060
30 s60720p27,600$0.004140
≥128 s*256 (capped)720p smart-resized~16,000~$0.00240
10 min*256 (capped)smart-resized to fit budget~16,000~$0.00240
*Beyond ~128 seconds, the sampler caps at 256 frames. Frames are smart-resized at the original aspect ratio so total patch count fits within max_patches_per_video = 131072. The effective token cap is ~16K per video regardless of source duration or resolution. The 30 s × 1080p combination (61,200 tokens at native) exceeds the patch budget and triggers smart-resize in practice.

Optimization Guidance

We recommend passing in the original resolution of the image. If the resolution approaches Mk1’s context budget, we recommend client-side preprocessing. Lower resolution can erode quality but may improve latency and reduce token counts.

Video preprocessing

Practical implications of the Video Token Counting spec above:
  • Sampler caps clip length to ~128 seconds: at target_fps = 2 and max_frames = 256, anything longer is truncated.
  • Frames are smart-resized at the original aspect ratio so the total patch count across all sampled frames fits within max_patches_per_video = 131072. The more frames sampled, the lower each frame’s effective resolution.
For batch processing, consider pre-resizing all images to a consistent resolution to optimize both quality and cost at scale.