Observability & Metrics

The Semantic Observability extension provides structured tracing of agent operations, Prometheus-format metrics export, and an aggregation query engine. All trace events are stored in Cortex as RDF triples for causal analysis.

TraceEmitter

The TraceEmitter buffers trace events in memory and flushes them to the Cortex /api/v1/events endpoint in batch.

Emit methods

MethodEvent typeKey fields
emitToolCall()tool_calltoolName, input, output, durationMs
emitLLMCall()llm_callmodel, promptTokens, completionTokens, durationMs
emitDecision()decisiondescription, alternatives, chosen, reasoning
emitDelegation()delegationparentId, childId, task, runId
emitError()errorerror, context
emitRaw()anyPush a pre-built event (used by SessionForkManager)

Buffer management

  • Maximum buffer size: 5,000 events (oldest dropped via FIFO)
  • Event TTL: 5 minutes — stale events discarded during flush
  • Flush interval: configurable (default varies by config)
  • Backoff: on flush failure, interval doubles up to 60 seconds
  • Final flush: on stop(), remaining events are flushed and any unflushed events are dropped with a warning

RDF serialization

Events are serialized with namespace-prefixed subjects:

subject: {ns}:event:{uuid}
type: tool_call
agentId: orchestrator
timestamp: 2026-03-08T12:00:00.000Z
fields: { toolName: "Read", input: "...", output: "..." }

MetricsExporter

The MetricsExporter collects counters and gauges, exporting them in Prometheus text exposition format with no external dependencies.

Registered metrics

MetricTypeLabelsDescription
mayros_tool_calls_totalcountertool_nameTotal tool calls by tool
mayros_llm_calls_totalcountermodelTotal LLM calls by model
mayros_llm_tokens_totalcounterdirectionTokens by prompt/completion
mayros_skill_queries_totalcountertoolSkill graph queries
mayros_cortex_requests_totalcounterstatusCortex requests by success/error
mayros_active_skillsgaugeNumber of active skills

Prometheus output

# HELP mayros_tool_calls_total Total tool calls by tool name
# TYPE mayros_tool_calls_total counter
mayros_tool_calls_total{tool_name="Read"} 42
mayros_tool_calls_total{tool_name="Write"} 15

# HELP mayros_llm_tokens_total Total LLM tokens by direction
# TYPE mayros_llm_tokens_total counter
mayros_llm_tokens_total{direction="prompt"} 125000
mayros_llm_tokens_total{direction="completion"} 45000

The metrics endpoint is registered at the configured path (e.g., /metrics) when metrics.enabled is true.

ObservabilityQueryEngine

The query engine provides aggregation and analysis over stored trace events.

AgentStats

typescript
type AgentStats = {
  agentId: string;
  totalEvents: number;
  toolCalls: number;
  llmCalls: number;
  decisions: number;
  delegations: number;
  errors: number;
  avgToolDurationMs: number;
  avgLLMDurationMs: number;
};

Query methods

MethodDescription
aggregateStats(agentId, timeRange?)Aggregate event counts and average durations
findSlowOps(agentId, thresholdMs)Find tool/LLM calls exceeding a duration
findErrors(agentId, limit?)Group and rank error patterns by frequency

Error patterns

The findErrors method groups errors by message and returns:

typescript
type ErrorPattern = {
  error: string;    // Error message
  count: number;    // Occurrence count
  lastSeen: string; // ISO timestamp of last occurrence
  agentId: string;
};

Agent tools

ToolDescription
trace_queryQuery trace events with optional agent, time range, type, and format filters
trace_explainExplain why an event occurred by walking its causal chain
trace_statsShow aggregated statistics for an agent
trace_session_forkFork a session into a new session
trace_session_rewindRewind a session to a timestamp

Output formats

The trace_query and trace_stats tools support multiple output formats:

  • terminal — formatted for console display
  • json — raw JSON output
  • markdown — formatted for documentation

Hooks wiring

HookConditionAction
after_tool_callcaptureToolCallsEmit tool_call event + increment metrics
llm_inputcaptureLLMCallsStart LLM call timer
llm_outputcaptureLLMCallsComplete timer, emit llm_call event + token metrics
subagent_spawnedcaptureDelegationsRecord delegation, start run timer
subagent_endedcaptureDelegationsComplete delegation, emit error if failed
agent_endtracing enabledEmit error event if agent run failed

Decision Graph integration

The observability plugin instantiates a DecisionGraph that queries Cortex for event triples and assembles them into decision trees. The trace_explain tool uses this to walk causal chains.

Health monitoring

A HealthMonitor periodically checks Cortex connectivity and emits decision/error events when the connection state changes (healthy/unhealthy).

CLI

bash
mayros observe status                    # Show config, Cortex status, buffered events
mayros observe events [--agent id] [--type tool_call] [--from iso] [--to iso]
mayros observe explain <eventId>         # Causal chain analysis
mayros observe stats [--agent id] [--format json]