Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.perceptron.inc/llms.txt

Use this file to discover all available pages before exploring further.

Scale performance and cost

Optimize Perceptron Mk1 for your SLA—whether you need low latency or high throughput.

Rate limits

EndpointLimit
Chat completions300 requests/min
Models30 requests/min
Media upload URL150 requests/min
Media download URL150 requests/min
Payload limits: 20 MB per request body, 20 GB media upload per 48 hours.
Need higher limits? Contact support@perceptron.inc.

Low latency pipelines

Use tight timeouts and config() scopes so interactive paths fail fast, stream minimal tokens, and never block your UI.
from perceptron import config, detect

def low_latency_detect(image_path):
    with config(timeout=8, max_tokens=256):
        return detect(
            image=image_path,
            classes=["scratch"],
            expects="box",
        )

result = low_latency_detect("part.jpg")

Parallel inference lanes

For bulk jobs, fan out API calls with a small worker pool and a shared runner so each task just swaps in the frame path.
import concurrent.futures
from functools import partial
from perceptron import detect

def process_images(images, classes):
    runner = partial(detect, classes=classes, expects="box")
    with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
        return list(executor.map(runner, images))

Throughput guardrails

Handle RateLimitError (429) by backing off and retrying. Use the Retry-After response header to determine when to retry—it returns a delay in seconds (e.g., Retry-After: 120). Add jitter to avoid thundering herd. Keep max_tokens tight so requests finish quickly. Resize images client-side to stay under the 20 MB request limit.
Combine tight max_tokens settings, aggressive image resizing, and on-prem endpoints to keep p99 latency under 120 ms while minimizing bandwidth usage.