Observability & Metrics
The Semantic Observability extension provides structured tracing of agent operations, Prometheus-format metrics export, and an aggregation query engine. All trace events are stored in Cortex as RDF triples for causal analysis.
TraceEmitter
The TraceEmitter buffers trace events in memory and flushes them to the Cortex /api/v1/events endpoint in batch.
Emit methods
| Method | Event type | Key fields |
|---|---|---|
emitToolCall() | tool_call | toolName, input, output, durationMs |
emitLLMCall() | llm_call | model, promptTokens, completionTokens, durationMs |
emitDecision() | decision | description, alternatives, chosen, reasoning |
emitDelegation() | delegation | parentId, childId, task, runId |
emitError() | error | error, context |
emitRaw() | any | Push a pre-built event (used by SessionForkManager) |
Buffer management
- Maximum buffer size: 5,000 events (oldest dropped via FIFO)
- Event TTL: 5 minutes — stale events discarded during flush
- Flush interval: configurable (default varies by config)
- Backoff: on flush failure, interval doubles up to 60 seconds
- Final flush: on
stop(), remaining events are flushed and any unflushed events are dropped with a warning
RDF serialization
Events are serialized with namespace-prefixed subjects:
subject: {ns}:event:{uuid}
type: tool_call
agentId: orchestrator
timestamp: 2026-03-08T12:00:00.000Z
fields: { toolName: "Read", input: "...", output: "..." }
MetricsExporter
The MetricsExporter collects counters and gauges, exporting them in Prometheus text exposition format with no external dependencies.
Registered metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
mayros_tool_calls_total | counter | tool_name | Total tool calls by tool |
mayros_llm_calls_total | counter | model | Total LLM calls by model |
mayros_llm_tokens_total | counter | direction | Tokens by prompt/completion |
mayros_skill_queries_total | counter | tool | Skill graph queries |
mayros_cortex_requests_total | counter | status | Cortex requests by success/error |
mayros_active_skills | gauge | — | Number of active skills |
Prometheus output
# HELP mayros_tool_calls_total Total tool calls by tool name
# TYPE mayros_tool_calls_total counter
mayros_tool_calls_total{tool_name="Read"} 42
mayros_tool_calls_total{tool_name="Write"} 15
# HELP mayros_llm_tokens_total Total LLM tokens by direction
# TYPE mayros_llm_tokens_total counter
mayros_llm_tokens_total{direction="prompt"} 125000
mayros_llm_tokens_total{direction="completion"} 45000
The metrics endpoint is registered at the configured path (e.g., /metrics) when metrics.enabled is true.
ObservabilityQueryEngine
The query engine provides aggregation and analysis over stored trace events.
AgentStats
typescripttype AgentStats = { agentId: string; totalEvents: number; toolCalls: number; llmCalls: number; decisions: number; delegations: number; errors: number; avgToolDurationMs: number; avgLLMDurationMs: number; };
Query methods
| Method | Description |
|---|---|
aggregateStats(agentId, timeRange?) | Aggregate event counts and average durations |
findSlowOps(agentId, thresholdMs) | Find tool/LLM calls exceeding a duration |
findErrors(agentId, limit?) | Group and rank error patterns by frequency |
Error patterns
The findErrors method groups errors by message and returns:
typescripttype ErrorPattern = { error: string; // Error message count: number; // Occurrence count lastSeen: string; // ISO timestamp of last occurrence agentId: string; };
Agent tools
| Tool | Description |
|---|---|
trace_query | Query trace events with optional agent, time range, type, and format filters |
trace_explain | Explain why an event occurred by walking its causal chain |
trace_stats | Show aggregated statistics for an agent |
trace_session_fork | Fork a session into a new session |
trace_session_rewind | Rewind a session to a timestamp |
Output formats
The trace_query and trace_stats tools support multiple output formats:
- terminal — formatted for console display
- json — raw JSON output
- markdown — formatted for documentation
Hooks wiring
| Hook | Condition | Action |
|---|---|---|
after_tool_call | captureToolCalls | Emit tool_call event + increment metrics |
llm_input | captureLLMCalls | Start LLM call timer |
llm_output | captureLLMCalls | Complete timer, emit llm_call event + token metrics |
subagent_spawned | captureDelegations | Record delegation, start run timer |
subagent_ended | captureDelegations | Complete delegation, emit error if failed |
agent_end | tracing enabled | Emit error event if agent run failed |
Decision Graph integration
The observability plugin instantiates a DecisionGraph that queries Cortex for event triples and assembles them into decision trees. The trace_explain tool uses this to walk causal chains.
Health monitoring
A HealthMonitor periodically checks Cortex connectivity and emits decision/error events when the connection state changes (healthy/unhealthy).
CLI
bashmayros observe status # Show config, Cortex status, buffered events mayros observe events [--agent id] [--type tool_call] [--from iso] [--to iso] mayros observe explain <eventId> # Causal chain analysis mayros observe stats [--agent id] [--format json]
Related
- Decision Graph — causal chain analysis and session trees
- Session Fork — fork and rewind sessions
- Token Economy — budget tracking and cost metrics
- Cortex — AIngle Cortex knowledge graph