Skip to main content

Run in Colab

Step through this example interactively
When you need answers grounded in when — not just what — pass expects="clip" and Perceptron Mk1 will return one or more Clip objects with start/end timestamps citing the moments that justify the answer. Use it for sports highlights, robot-task success/failure labeling, compliance event detection, and any workflow that turns long video into structured temporal signal.

Basic usage

from perceptron import question, video

result = question(
    video(video_path),         # str: Local path or URL to MP4 or WebM
    "Clip when the event happens.",  # str: Natural-language question
    reasoning=True,            # bool: enable reasoning
    expects="clip",            # str: parse <clip> tags into structured Clip objects
)

print(result.text)             # Natural-language answer with inline <clip> tags
for clip in result.clips or []:
    print(clip.timestamp.at, clip.timestamp.until, clip.mention)
Parameters:
ParameterTypeDefaultDescription
media_objVideoNode-Wrap your MP4 or WebM (URL or local file path) with video()
question_textstr-Prompt describing what to clip
reasoningboolFalseSet True to let the model think through the video before localizing
expectsstr"text"Set "clip" to parse <clip> tags emitted by the model into Clip objects
Returns: PerceiveResult object:
  • text (str): Natural-language answer with inline <clip> tags as the model emitted them.
  • reasoning (str | None): Chain-of-thought when reasoning=True.
  • clips (list[Clip] | None): Parsed temporal segments. Each Clip has:
    • timestamp.at (float): start in seconds.
    • timestamp.until (float | None): end in seconds, or None for a single moment.
    • mention (str | None): optional label the model attached.

Example: Find the shot

In this example we download a short basketball clip, ask Perceptron Mk1 to clip the moment the ball passes through the hoop, and inspect the returned timestamps.
from pathlib import Path
from urllib.request import urlretrieve

from perceptron import configure, question, video

configure(
    provider="perceptron",
    model="perceptron-mk1",
    api_key="YOUR_API_KEY",
)

# Download reference video
VIDEO_URL = "https://raw.githubusercontent.com/perceptron-ai-inc/perceptron/main/cookbook/_shared/assets/capabilities/video-clipping/mj_shot_short.mp4"
VIDEO_PATH = Path("mj_shot_short.mp4")

if not VIDEO_PATH.exists():
    urlretrieve(VIDEO_URL, VIDEO_PATH)

# Ask the model to clip the moment
result = question(
    video(str(VIDEO_PATH)),
    "Clip the exact moment the ball passes through the hoop.",
    reasoning=True,
    expects="clip",
)

print(result.text)

clips = result.clips or []
for idx, clip in enumerate(clips, start=1):
    ts = clip.timestamp
    window = f"{ts.at:.2f}s" if ts.until is None else f"{ts.at:.2f}s - {ts.until:.2f}s"
    label = clip.mention or "(no mention)"
    print(f"Clip {idx}: {window} - {label}")

Output format

The model emits self-closing <clip /> tags inline in the response. mention is an attribute (not body text), and timestamps are whitespace-separated with the literal unit seconds:
<clip mention="ball through hoop" t="3.2 seconds" />              <!-- single moment -->
<clip mention="drive to the basket" t="3.2 seconds 5.1 seconds" /> <!-- range -->
Multiple clips that share an event are typically wrapped in a <collection>, and child clips inherit the collection’s mention when their own is omitted:
<collection mention="ramp trick">
  <clip t="7.6 seconds 9.7 seconds" />
</collection>
Passing expects="clip" parses these tags into Clip objects exposed on result.clips, so you can iterate timestamps directly instead of parsing the tag text yourself. The full text — including any prose around the tags — remains available on result.text.

Best practices

  • Be specific about the event: “Clip the moment the ball passes through the hoop” works better than “find interesting moments.” Tight, observable predicates produce tight clips.
  • A single moment vs. a range: When clip.timestamp.until is None, the model is pointing at a single instant rather than a span. Both are valid; treat the moment case as “approximate point in time” rather than “zero-length range.”
Run through the full Jupyter notebook here. Reach out to Perceptron support if you have questions.