Skip to main content

Scale performance and cost

Optimize Isaac for your SLA—whether you need sub-100ms latency or high throughput.

Rate limits

EndpointLimit
Chat completions300 requests/min
Models30 requests/min
Media upload URL150 requests/min
Media download URL150 requests/min
Payload limits: 20 MB per request body, 20 GB media upload per 48 hours.
Need higher limits? Contact [email protected].

Low latency pipelines

Use tight timeouts and config() scopes so interactive paths fail fast, stream minimal tokens, and never block your UI.
from perceptron import config, detect

def low_latency_detect(image_path):
    with config(timeout=8, max_tokens=256):
        return detect(
            image=image_path,
            classes=["scratch"],
            expects="box",
        )

result = low_latency_detect("part.jpg")

Parallel inference lanes

For bulk jobs, fan out API calls with a small worker pool and a shared runner so each task just swaps in the frame path.
import concurrent.futures
from functools import partial
from perceptron import detect

def process_images(images, classes):
    runner = partial(detect, classes=classes, expects="box")
    with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
        return list(executor.map(runner, images))

Throughput guardrails

Handle RateLimitError (429) by backing off and retrying. Use the Retry-After response header to determine when to retry—it returns a delay in seconds (e.g., Retry-After: 120). Add jitter to avoid thundering herd. Keep max_tokens tight so requests finish quickly. Resize images client-side to stay under the 20 MB request limit.
Combine tight max_tokens settings, aggressive image resizing, and on-prem endpoints to keep p99 latency under 120 ms while minimizing bandwidth usage.