If we translate that idea into the LLM world, the closest existing analogue is “LLM observability” and “prompt tracing.” In production, the unit of work is no longer a video frame but an LLM interaction span: a single model call, a chain step, or an agent action. Modern platforms already treat each of these as a structured event with rich attributes. LaunchDarkly, for example, records each LLM call as a span with model name, prompt and response content, token usage, request duration, and provider metadata, and exposes them in a traces view specifically marked as “LLM spans.”. Elastic does something similar: it ingests metrics and logs from LLM APIs, and uses OpenTelemetry-based APM tracing to capture model used, request duration, errors, token consumption, and the relationship between prompts and responses. Open source SDKs like genai telemetry push this further by auto instrumenting LLM calls and exporting traces, token usage, latency, errors, and cost to arbitrary backends (Splunk, Elasticsearch, Datadog, Prometheus, etc.).
These systems turn each model interaction into a high dimensional event that can be sliced, traced, and correlated. The “commentary” in this context is both the raw prompt and completion, and a structured envelope around them: model id, temperature, system prompt, user segment, application feature, tool calls, safety filters triggered, evaluation scores, and so on. The LLM span is the observability primitive, and the prompt/response pair is just one field inside it.
Commentary could be a semantic compression layer—“Raw video → High entropy, low accessibility; Commentary → Lower entropy, high semantic interpretability”—the LLM world has an interesting inversion. The model’s output is already natural language, but it is still too unstructured to drive reliable analytics or agentic control at scale. So the industry is converging on a second layer of “commentary on the commentary”: annotations and custom metrics attached to each LLM span. These include things like:
• quality and correctness scores from automatic evaluators or human labels
• safety and policy scores (toxicity, PII, jailbreak likelihood, etc.)
• hallucination or grounding scores for RAG flows
• reasoning step metadata for agents (which tools were called, what state changed, which branch was taken)
• user level and session level context (tenant, feature flag, experiment bucket, business outcome)
In practice, these annotations are implemented as span attributes and child events in tracing systems. OpenTelemetry semantic conventions for AI/LLM spans (and vendor specific extensions) define standard attributes for model name, input/output token counts, latency, error type, and sometimes prompt/response hashes. On top of that, teams add arbitrary, high cardinality dimensions—feature name, experiment id, user cohort, guardrail outcome—very much in the spirit of users being able to add arbitrary new dimensions without redesigning the system.
Work under the LLMOps / GenAIOps umbrella focuses on telemetry and evaluation pipelines for LLM applications: logging every prompt/response pair, attaching automatic evaluation scores (helpfulness, factuality, safety), and using those logs as a substrate for debugging and continuous improvement. Other papers on “LLM traces” and “agent trajectories” treat multi step agent runs as traces, where each step is a structured event with fields for the thought, the tool call, the observation, and the next action. Those trajectories are then mined for failure patterns, cost hotspots, and behavioral anomalies where each stage (ingest, decode, inference, commentary generation, alerting) becomes a span in a trace.
There is also a growing body of work on automatic LLM evaluation frameworks that effectively define a vocabulary of custom metrics intrinsic to LLM behavior: coherence, consistency with retrieved documents, instruction adherence, style similarity, and so on. These frameworks often emit per interaction scores that can be logged alongside the raw prompts and completions. When those scores are treated as first class metrics in an observability backend, we get the same kind of semantic analytics as we envision for drone video: “anomaly density per time window” becomes “hallucination density per feature per release,” “behavior transition rates” become “tool usage transition rates across agent steps,” and “path reconstruction statistics” become “agent trajectory statistics” (how often agents loop, backtrack, or escalate to humans).
If we put it all together, the LLM analogue of our proposal looks like this:
• the unit of work is an LLM span (or agent step), not a frame
• each span is a wide, structured event containing prompt, response, model parameters, and context
• annotations are added as semantic labels and scores: quality, safety, grounding, reasoning steps, tool calls
• custom metrics are derived from those annotations: cost per outcome, hallucination rate per feature, escalation rate per cohort, latency vs. quality trade offs
• traces stitch spans into end to end flows: user request → retrieval → LLM calls → tools → final answer, enabling root cause analysis and optimization
Industry observability stacks for LLMs—LaunchDarkly’s LLM spans, Elastic’s LLM APM and dashboards, and SDKs like genai telemetry—are already implementing large parts of this pattern in production. Academic proposals around LLMOps, agent traces, and automatic evaluation are filling in the semantics of the annotations and metrics that matter.
No comments:
Post a Comment