ResearchDistributed SystemsReal-Time Systems

Reliability Patterns in Real-Time Telemetry Pipelines

January 20, 2026·6 min read

Real-time telemetry pipelines have a deceptively simple job: move data from sensors to consumers quickly and correctly. In practice, "correctly" is the hard part. This post documents patterns I've been studying and applying while working on monitoring infrastructure at Hydro One.

The Core Problem: Distributed Agreement

A telemetry pipeline is a distributed system in miniature. Field devices produce state. A pipeline transports and transforms it. A backend stores and surfaces it. At any moment, these three layers can hold different views of reality. The reliability problem is fundamentally about minimizing the window in which they disagree, and detecting it when they do.

Pattern 1: Structured Input Simulation

The most effective validation technique I've used is injecting structured input events at the source and verifying their exact representation at every downstream layer. This is different from unit testing the pipeline logic. It tests the live system end-to-end, including serialization, network transport, and backend ingestion.

python
def simulate_alarm_event(device_id: str, point: str, state: bool):
    """
    Inject a synthetic alarm state change and verify
    propagation through the full pipeline.
    """
    payload = {
        "device": device_id,
        "point": point,
        "value": int(state),
        "timestamp": time.time_ns(),
    }
    publish(payload)

    # Poll backend until state propagates or timeout
    deadline = time.monotonic() + MAX_PROPAGATION_SECONDS
    while time.monotonic() < deadline:
        backend_state = query_backend(device_id, point)
        if backend_state == state:
            return True
        time.sleep(POLL_INTERVAL)

    raise PropagationTimeoutError(device_id, point)

Pattern 2: Cross-Layer Failure Tracing

When a telemetry discrepancy is detected, the failure can be in any layer. A systematic trace works from the source outward: verify the device reported the expected value, then verify the pipeline transported it unchanged, then verify the backend ingested and stored it correctly. Skipping layers wastes time and misses intermittent faults.

  • Layer 1 (Device): Did the firmware generate the correct output for the given input?
  • Layer 2 (Transport): Was the message delivered without corruption or loss?
  • Layer 3 (Backend): Was the value ingested, parsed, and stored with the correct semantics?

Pattern 3: Idempotent State Reporting

Field devices in unreliable network environments will retransmit. Pipelines that are not idempotent will duplicate state changes, producing phantom alarm events in dashboards. The fix is to key every ingest operation on a deterministic event ID derived from device ID, point, timestamp, and value. Duplicate deliveries get silently deduplicated at the backend.

Reliability is not the absence of failure. It is the containment of failure to a known, recoverable scope.

This is still a work in progress. I'm continuing to document patterns as I encounter and validate them. The goal is a set of testable, language-agnostic patterns for anyone building telemetry infrastructure where correctness is non-negotiable.