Runtime Metrics¶
TopoExec runtime metrics are exposed through RuntimeRunnerResult::runtime_metrics and through the CLI:
topoexec graph metrics examples/minimal.yaml --steps 1 --format json
The JSON output is a RuntimeRunnerResult object. It includes
metric_schema_version: "1" and a metrics field with samples in this shape:
{
"name": "runtime.channel.publish_count",
"value": 1.0,
"component_id": "",
"lane": "",
"channel_id": "source_transform",
"tags": []
}
Metric names are part of the public observability contract. Prefer adding new names over changing the meaning of existing names.
Metric Schema v1¶
topoexec/runtime/metric_schema.hpp exposes the descriptor registry used by
exporter adapters:
kRuntimeMetricSchemaVersion: currently"1".runtime_metric_descriptors(): returns descriptors withname,kind,unit, allowedlabels, cardinality rule, stability, and description.validate_runtime_metric_samples(...): verifies exported samples against descriptor cardinality rules.
Default runtime metrics may use only bounded labels: lane, component_id,
and channel_id. Graph-defined component ids, lane ids, channel ids, and
CompositeLoop ids are allowed. Correlation, causation, transaction, trace, and
request ids are intentionally not default metric labels; keep them in trace or
log records to avoid label explosion.
MetricRegistry::histogram(name) exports lightweight in-process histogram summaries without external dependencies. Snapshot suffixes are name.count, name.min, name.max, name.avg, name.p50, name.p95, and name.p99; percentile values use deterministic linear interpolation over the observed sample set.
OTel Preview Mapping¶
The optional topoexec_adapters::otel target maps runtime metric samples
through runtime_metric_descriptors() before producing exporter-preview records:
- descriptor
counter-> OTel counter-shaped record; - descriptor
gauge-> observable-gauge-shaped record; - descriptor
histogram-> histogram-shaped record; - descriptor
unit,stability, and schema version are preserved as record metadata; - only descriptor-allowed bounded labels (
component_id,lane,channel_id) become metric attributes.
Raw sample tags are not exported as labels. The preview records only
topoexec.ignored_tag_count when tags are present so adapter authors can spot
non-schema tag use without creating high-cardinality time series. The preview
does not require or link an external telemetry SDK.
Prometheus Preview Mapping¶
The optional topoexec_adapters::prometheus target renders runtime metrics
as dependency-free text exposition:
- descriptor
counter-> sanitized metric name with_totaland# TYPE ... counter; - descriptor
gauge-> sanitized metric name and# TYPE ... gauge; - custom
MetricRegistry::histogram(name)summaries -> Prometheus summary text from<name>.count,<name>.avg,<name>.p50,<name>.p95, and<name>.p99, plus min/max/avg gauge helpers when those samples are present; - only descriptor-allowed bounded labels (
component_id,lane,channel_id) become text labels.
Raw sample tags are ignored and unexpected descriptor labels are rejected, so
correlation, causation, transaction, trace, or request ids cannot become default
Prometheus labels. The preview does not start an HTTP server or link a
Prometheus library; embedding applications own any scrape endpoint or transport.
Stable Names¶
Scheduler:
runtime.scheduler.tick_count: fixed-rate ticks executed by the lane. Non-fixed_ratelanes report0.runtime.scheduler.completed_count: completed component invocations for a lane.runtime.scheduler.tick_overrun_count: lane tick overruns observed by the scheduler.runtime.scheduler.skipped_tick_count: fixed-rate ticks intentionally skipped by an overrun policy. This remains0for deterministic stepping unless a wall-clock overrun policy skips ticks.runtime.scheduler.max_lateness_ms: maximum observed fixed-rate lateness or simulated overrun amount.runtime.scheduler.queue_depth: maximum queued scheduler tasks observed for a lane. This is0for the single-thread event loop.runtime.scheduler.queue_capacity: configured/effective pending queue capacity for the lane.runtime.scheduler.worker_count: configured/effective worker count for the lane; forthread_pool, this is the persistent worker count.runtime.scheduler.last_callback_duration_ms: latest scheduler iteration duration observed for the lane.runtime.scheduler.blocked_duration_ms: maximum wall-clock wait duration inserted before a fixed-rate tick.runtime.scheduler.tick_jitter_ms: positive simulated overrun amount above the fixed-rate period or tick budget.runtime.scheduler.active_count: maximum active workers observed for a lane. This is0for the single-thread event loop.runtime.scheduler.in_flight_count: maximum in-flight scheduler tasks observed for a lane. This is0for the single-thread event loop.runtime.scheduler.rejected_count: scheduler admission rejections, skipped ready invocations, or dropped ready invocations for a lane.runtime.scheduler.priority_high_count: completed invocations with runtimeexecution.priority: high.runtime.scheduler.priority_normal_count: completed invocations with default or explicit runtimeexecution.priority: normal.runtime.scheduler.priority_low_count: completed invocations with runtimeexecution.priority: low.runtime.scheduler.priority_background_count: completed invocations with runtimeexecution.priority: background.runtime.scheduler.low_priority_rejected_count: low/background ready invocations rejected or dropped by lane admission overflow.runtime.scheduler.starvation_guard_count: explicit runtime starvation-guard interventions; v1 normally reports0because no aging intervention is implemented.
Components:
runtime.component.execution_count: completedComponent::execute_status()invocations for a component.runtime.component.error_count: failed status-returning or exception-wrapped component invocations.runtime.component.last_duration_ns: most recent component execution duration in nanoseconds.runtime.component.max_duration_ns: maximum component execution duration observed in the run.runtime.component.budget_overrun_count: component invocations whose duration exceededexecution.budget_ms.runtime.component.cancellation_requested_count: component invocations that completed after their cooperative cancellation token was requested.runtime.component.cancellation_observed_count: component invocations that calledInvocation::cancel_requested(),GraphContext::cancel_requested(), or legacystop_requested()after cancellation was requested.runtime.component.timeout_budget_exceeded_count: component invocations whose cooperative timeout budget was exceeded. This is reported after the invocation returns; it is not hard preemption.runtime.component.max_in_flight_count: maximum concurrent in-flight invocations observed for the component. It is1or less for non-reentrant components on the current event-loop runtime.
Triggers:
runtime.trigger.ready_count: invocations admitted by trigger readiness for a component.runtime.trigger.suppressed_count: event-driven checks that did not admit an invocation.runtime.trigger.coalesced_count: invocations admitted through a coalescing trigger policy.runtime.trigger.timeout_drop_count: pending messages dropped by triggermax_latency_ms.runtime.trigger.batch_flush_count: batch invocations flushed by size or window.runtime.trigger.time_sync_drop_count: oldest out-of-slop samples dropped bytime_sync.runtime.trigger.late_drop_count: timestamped samples dropped bywatermarkbecause they arrived behind the accepted lateness window.runtime.trigger.pending_drop_count: pending trigger messages dropped to keep suppressedrate_limit,debounce,coalesce, orconditionqueues within the derived per-input bound.runtime.trigger.condition_suppressed_count:conditionchecks that did not admit an invocation because the declarative predicate was not satisfied.runtime.trigger.rate_limit_suppressed_count: ready checks suppressed bymin_interval_ms, includingrate_limittrigger policies.
Channels:
runtime.channel.publish_count: accepted publications for a channel.runtime.channel.delivery_count: delivered messages for a channel.runtime.channel.drop_count: messages that did not enter a consumable path or expired before delivery.runtime.channel.deadline_miss_count: delivered messages older thanpolicy.deadline_ms.runtime.channel.stale_drop_count: messages dropped before delivery becausepolicy.lifespan_msexpired.runtime.channel.reject_count: publications rejected or would-blocked by capacity/backpressure policy.runtime.channel.overwrite_count: stored messages overwritten or dropped-oldest to accept newer work.runtime.channel.health_event_count: aggregate channel degradation events from overwrite, reject, stale, or deadline paths.runtime.channel.max_depth: maximum observed channel depth.runtime.channel.message_age_ms: latest observed age at delivery time for that channel.runtime.channel.payload_copy_count: payload copies forced by channel copy policy.
Fan-out and batch commits preflight deterministic reject paths before making payloads visible. A failed preflight does not partially publish earlier payloads, but the rejecting channel still records reject/drop metrics and health events so operators can diagnose the failed commit.
Health events:
runtime.health.event_count: bounded health events retained inRuntimeRunnerResult::health_events.runtime.health.dropped_count: health events dropped because the sink capacity was full or the sink could not accept the event without waiting.runtime.health.coalesced_count: repeated health events merged into an existing retained event with the same coalescing key.
Health event metrics summarize the observer sink only. Channel-specific runtime.channel.health_event_count remains the per-channel degradation counter and is not a substitute for the bounded event list.
Publication router:
runtime.publication.staged: total staged publications observed byGraphContext::publish().runtime.publication.committed: staged publications committed to runtime channels.runtime.publication.delayed: publications deferred bydelayedges.runtime.publication.state: publications deferred bystateedges.runtime.publication.state_committed: state-edge publications committed at epoch boundaries.runtime.publication.async: publications deferred byasyncedges.runtime.publication.failed_commit: failed publication commit batches.runtime.publication.composite_discarded: CompositeLoop external outputs discarded by a partial-success policy before they reached runtime channels.
Async admission:
runtime.async.accepted_count: async completions accepted by edge-level admission.runtime.async.rejected_count: async completions rejected bypolicy.max_inflightadmission.runtime.async.dropped_count: newest async completions dropped instead of being retained, such asdrop_newest.runtime.async.overwrite_count: older pending async completions removed bydrop_oldest/overwriteadmission to retain a newer completion.runtime.async.in_flight_count: async completions still pending deferred delivery at the end of the run.runtime.async.max_in_flight_count: maximum pending async completions observed during the run.runtime.async.completed_count: async completions committed to runtime channels at epoch boundaries.runtime.async.cancelled_count: async completions cancelled by shutdown. This is0in the current cooperative MVP.
Composite loops:
runtime.loop.iterations: loop iterations executed for a CompositeLoop region.component_idcarries the loop id.runtime.loop.converged: convergence stops for a CompositeLoop region.component_idcarries the loop id.runtime.loop.budget_overrun: budget stops for a CompositeLoop region.component_idcarries the loop id.runtime.loop.max_iterations_hit: max-iteration stops for a CompositeLoop region.component_idcarries the loop id.runtime.loop.error: internal CompositeLoop component failures.component_idcarries the loop id.runtime.loop.cancellation_requested: cancellation requests observed by a CompositeLoop between iterations.component_idcarries the loop id.runtime.loop.cancellation_observed: CompositeLoop cancellation stops performed between iterations.component_idcarries the loop id.runtime.loop.output_discarded: non-converged solver stops whose staged external outputs were discarded.component_idcarries the loop id.runtime.loop.residual: latest residual reported by a solver-style CompositeLoop iteration.component_idcarries the loop id.
State/config snapshots:
runtime.state.staged_write_count: blackboard writes staged for an epoch-boundary commit.runtime.state.committed_write_count: blackboard writes applied at epoch boundaries.runtime.state.rejected_write_count: blackboard writes rejected by empty namespace/key/writer, null payload, or single-writer policy.runtime.state.snapshot_read_count: blackboard snapshots/read calls.runtime.state.current_value_count: current committed blackboard values.runtime.config.version: committed config transaction version. Initial graph/component config load starts at version 0.runtime.config.last_transaction_id: most recently committed config transaction id.runtime.config.staged_update_count: component config snapshots staged for an epoch-boundary update.runtime.config.committed_update_count: component config snapshots applied at epoch boundaries.runtime.config.immediate_update_count: explicitly immediate component config updates.runtime.config.rejected_update_count: rejected config updates.runtime.config.rolled_back_update_count: pending component config updates discarded by failed validation/apply rollback.runtime.config.snapshot_read_count: graph/component config snapshot reads.
Component lifecycle:
runtime.lifecycle.reset_count: component reset hooks completed at the start epoch boundary.runtime.lifecycle.reset_failure_count: reset hooks rejected or threw before scheduler execution.runtime.lifecycle.restore_count: component state snapshots restored before execution.runtime.lifecycle.restore_failure_count: restore requests rejected by component id, type, version, or payload validation.runtime.lifecycle.snapshot_count: component state snapshots captured after scheduler execution.runtime.lifecycle.snapshot_failure_count: snapshot capture failures reported after execution.runtime.lifecycle.snapshot_size_bytes: total byte size reported or estimated for captured component snapshots.
Trace:
runtime.trace.event_count: structured trace events emitted during the run.
Runtime observer:
runtime.observer.failure_count: best-effort observer callback/status failures recorded without changing runtimeok.runtime.observer.dropped_event_count: events dropped by bounded observers such asInMemoryRuntimeObserver.
Custom histograms:
<name>.count,<name>.min,<name>.max,<name>.avg,<name>.p50,<name>.p95,<name>.p99: summaries emitted byMetricRegistry::histogram(name).
Aggregate Counters¶
The top-level JSON result also carries aggregate counters for common dashboards. channel_drop_count is the real-drop total; use the breakdown fields below to explain overwrite, rejection, stale expiry, or deadline misses separately.
channel_publish_countchannel_delivery_countchannel_drop_countchannel_overwrite_countchannel_reject_countchannel_stale_drop_countchannel_deadline_miss_countpayload_copy_countstaged_publication_countcommitted_publication_countdelayed_publication_countstate_publication_countstate_commit_countasync_publication_countfailed_publication_commit_countloop_iteration_countloop_converged_countloop_budget_overrun_countloop_max_iteration_hit_countloop_output_discarded_countloop_last_residualloop_stop_reasonloop_cancellation_requested_countloop_cancellation_observed_count
These fields summarize the sample array for quick CLI and test assertions; the sample array remains the extensible product surface. Correlation and causation metadata are emitted on trace events and invocation/channel records, but they are not default metric labels to avoid cardinality explosion.
Channel health metrics¶
See channels.md for the channel/backpressure contract. Deadline misses, stale drops, rejects, overwrites, and aggregate health counters are exported as stable runtime.channel.* metrics. Optional bounded health events are observer records in RuntimeRunnerResult::health_events and trace output, not default metric labels.