Real-time telemetry pipelines have a deceptively simple job: move data from sensors to consumers quickly and correctly. In practice, "correctly" is the hard part. This post documents patterns I've been studying and applying while working on monitoring infrastructure at Hydro One.
A telemetry pipeline is a distributed system in miniature. Field devices produce state. A pipeline transports and transforms it. A backend stores and surfaces it. At any moment, these three layers can hold different views of reality. The reliability problem is fundamentally about minimizing the window in which they disagree, and detecting it when they do.
The most effective validation technique I've used is injecting structured input events at the source and verifying their exact representation at every downstream layer. This is different from unit testing the pipeline logic. It tests the live system end-to-end, including serialization, network transport, and backend ingestion.
def simulate_alarm_event(device_id: str, point: str, state: bool):
"""
Inject a synthetic alarm state change and verify
propagation through the full pipeline.
"""
payload = {
"device": device_id,
"point": point,
"value": int(state),
"timestamp": time.time_ns(),
}
publish(payload)
# Poll backend until state propagates or timeout
deadline = time.monotonic() + MAX_PROPAGATION_SECONDS
while time.monotonic() < deadline:
backend_state = query_backend(device_id, point)
if backend_state == state:
return True
time.sleep(POLL_INTERVAL)
raise PropagationTimeoutError(device_id, point)When a telemetry discrepancy is detected, the failure can be in any layer. A systematic trace works from the source outward: verify the device reported the expected value, then verify the pipeline transported it unchanged, then verify the backend ingested and stored it correctly. Skipping layers wastes time and misses intermittent faults.
Field devices in unreliable network environments will retransmit. Pipelines that are not idempotent will duplicate state changes, producing phantom alarm events in dashboards. The fix is to key every ingest operation on a deterministic event ID derived from device ID, point, timestamp, and value. Duplicate deliveries get silently deduplicated at the backend.
Reliability is not the absence of failure. It is the containment of failure to a known, recoverable scope.
This is still a work in progress. I'm continuing to document patterns as I encounter and validate them. The goal is a set of testable, language-agnostic patterns for anyone building telemetry infrastructure where correctness is non-negotiable.