March 18, 2026

Found a Bug Running 100 Simulated Routers

I spun up a 100-router simulation to see what would break. Something did.

The Setup

The simulation uses a mock RouterOS API server that speaks the real binary wire protocol. Each instance returns realistic, slowly-drifting metrics — CPU load follows a sine wave with random noise and occasional spikes, interface counters increment at plausible rates, wireless client counts fluctuate. From the poller's perspective, these are real devices.

101 mock devices across three tenants, all being polled every 60 seconds. That's about 500 NATS messages per cycle covering device status, health metrics, interface statistics, wireless data, and firmware checks. The kind of sustained load you'd see in a real MSP deployment.

What Happened

Everything worked fine for hours. The dashboard showed live data, metrics were flowing into TimescaleDB, events were streaming. Then around the 10-hour mark, the API started returning empty responses. Health checks failed. The poller kept running but the web interface was dead.

Container stats told the story: NATS JetStream was at 125MB out of its 128MB memory limit. It was essentially out of memory.

The Root Cause

JetStream retains messages in the stream until they expire or hit a configured limit. When consumers — the API's metrics subscriber, firmware subscriber, SSE manager, and so on — read and process a message, that advances the consumer's cursor. It does not delete the message from the stream.

So every device status event, every health metric, every firmware check from the last 24 hours was still sitting in NATS memory. All of it already consumed, processed, and safely written to Postgres. None of it needed anymore.

This was effectively a 24-hour replay buffer that nothing was replaying.

The Math

101 devices, 5 messages each per poll cycle, once per minute. That's roughly 727,000 messages per day at 400-600 bytes each. North of 300MB before the 24-hour expiry window even starts trimming. The 128MB container memory limit — which I set — never stood a chance.

With 10 devices in development, this was invisible. The daily volume was maybe 3-4MB. You'd never notice. Scale to 100 and the math changes completely.

The Fix

Added a 64MB byte cap to the DEVICE_EVENTS stream with a discard-oldest policy. When the stream fills up, the oldest messages get dropped. Since every message has already been consumed and persisted to the database by that point, nothing is lost.

The cap was applied live to the running system. NATS immediately trimmed from 133MB to 64MB by discarding old messages. The API came back up. Two lines in the stream configuration.

The Tradeoff

The replay window is now shorter. If a consumer goes down for a long time and comes back, it might miss messages that were already discarded. In practice this is acceptable — the consumer will catch current state on the next poll cycle, and the historical data is already persisted in TimescaleDB where it belongs.

A message broker shouldn't be doing the job of a time-series database. If durable replay ever becomes important — for audit trails or compliance — that's a storage problem, not a messaging problem.

What This Actually Reveals

Infrastructure defaults are not your defaults. JetStream's retention behavior is well-documented and correct. But the default — keep everything until the max age expires — assumes you've thought about how much data that is. I hadn't. Not at scale.

This is the kind of bug that doesn't show up in development, doesn't show up in code review, and doesn't show up in unit tests. It shows up when you run 100 devices for 10 hours and watch what happens. That's why simulation testing matters more than most people think it does.

The system handled the load just fine functionally. Every message was processed correctly. Every metric was stored. The architecture was right. The operational configuration was wrong. Those are different problems, and they require different kinds of testing to find.

The Bottom Line

This is why I don't trust anything until I try to break it.

The Other Dude is open source MikroTik fleet management — read the docs or view on GitHub.