500 Devices Broke the API
The API container fell over this morning.
500 simulated MikroTik devices. A 512MB container. Debug logging turned on. It got OOM-killed. I had to restart it manually.
This is not a dramatic story. It's a config problem that becomes obvious in hindsight, and it's exactly the kind of thing that bites you when you move from dev scale to real scale.
What Was Happening
The mock fleet had been scaled up to around 500 devices over the previous few days. The mock server generates realistic RouterOS responses — interfaces, traffic counters, wireless registration tables, the works. Every two minutes, the poller hits all 500 devices and pushes the results to the API.
Each poll cycle, the API processes:
- Interface metrics for every port on every device
- Wireless registration tables from every AP
- Wireless link discovery and state tracking
- Device interface inventory updates
That's thousands of database inserts and upserts per cycle. For a system designed to manage hundreds of routers, this is normal load. This is what the thing is supposed to handle.
What Went Wrong
Debug logging was on. The dev environment had LOG_LEVEL=debug, which tells SQLAlchemy to echo every SQL statement to stdout. The application logger was also printing each query with ANSI formatting. So every single INSERT and UPSERT was being string-formatted, colorized, and written to stdout — twice. With 500 devices generating thousands of queries per cycle, that's an enormous amount of string allocation and I/O churn just for log output nobody was reading.
Single Gunicorn worker. The dev config ran one worker process. All poll data processing was serialized through a single Python process — no distribution of load, no way to spread memory pressure across processes. One worker means one process accumulating everything.
Container was too small. 512MB was fine when the system was handling 50 or 100 devices. At 500 devices with debug logging, it wasn't even close. The container would climb to 70%+ memory usage during normal operation and eventually hit the wall.
None of these are bugs. They're development-scale defaults that don't survive production-scale load. The kind of thing that works fine until it doesn't.
What the Investigation Showed
docker stats told most of the story: 362MB out of 512MB used, CPU pegged at 112%. The container was spending more time formatting log strings than processing actual data. The logs themselves were wall-to-wall SQL — every INSERT, every UPSERT, every COMMIT and ROLLBACK, printed twice with full parameter lists.
The container's restart policy was on-failure, so after the OOM kill it came back up, loaded the same config, and started climbing toward the same ceiling. Rinse and repeat until someone noticed.
The Fix
Three changes in docker-compose.override.yml. Nothing clever.
LOG_LEVEL: debug → info. This was the biggest impact. Stopped SQLAlchemy from echoing every query. Stopped the double-logging. Removed the single largest source of memory churn. If you're not actively debugging SQL, you don't need to see every INSERT scroll past.
GUNICORN_WORKERS: 1 → 2. Spreads request processing across two worker processes. Each process handles a portion of the incoming poll data, reducing per-process memory accumulation. Not a radical change, but it matters when you're processing thousands of writes per cycle.
Memory limit: 512MB → 1GB. Gives the API actual headroom for this workload. 512MB was a dev-era guess. 1GB reflects what the system actually needs when managing hundreds of devices.
Before and After
Before: 362MB / 512MB — 71% memory usage, climbing toward OOM.
After: 307MB / 1GB — 30% memory usage, stable under the same load.
Same 500 devices. Same poll interval. Same data volume. The system just stopped wasting resources on logging nobody was reading and got enough room to breathe.
The Takeaway
Development-scale configs don't survive production-scale load. This is not surprising. But it's easy to forget when the thing has been running fine for weeks at a smaller scale and you gradually crank it up.
Debug logging is expensive. Not just disk space — string formatting, memory allocation, I/O buffering. At scale, your logging layer can consume more resources than your actual application logic. Turn it off unless you're actively using it.
Container sizing matters. The number you picked when you had 50 devices is not the number you need at 500. Review your resource limits when your workload changes. docker stats is right there.
If you're self-hosting this with more than a couple hundred devices, don't run the default dev config. Bump your memory limits. Set the log level to info. Give the API more than one worker. The defaults are tuned for a developer laptop, not a production deployment.
Better it dies in a test environment than at 2am managing real infrastructure.