← Back to Blog

500 Devices Broke the API

The API container fell over this morning.

500 simulated MikroTik devices. A 512MB container. Debug logging turned on. It got OOM-killed. I had to restart it manually.

This is not a dramatic story. It's a config problem that becomes obvious in hindsight, and it's exactly the kind of thing that bites you when you move from dev scale to real scale.

What Was Happening

The mock fleet had been scaled up to around 500 devices over the previous few days. The mock server generates realistic RouterOS responses — interfaces, traffic counters, wireless registration tables, the works. Every two minutes, the poller hits all 500 devices and pushes the results to the API.

Each poll cycle, the API processes:

That's thousands of database inserts and upserts per cycle. For a system designed to manage hundreds of routers, this is normal load. This is what the thing is supposed to handle.

What Went Wrong

Debug logging was on. The dev environment had LOG_LEVEL=debug, which tells SQLAlchemy to echo every SQL statement to stdout. The application logger was also printing each query with ANSI formatting. So every single INSERT and UPSERT was being string-formatted, colorized, and written to stdout — twice. With 500 devices generating thousands of queries per cycle, that's an enormous amount of string allocation and I/O churn just for log output nobody was reading.

Single Gunicorn worker. The dev config ran one worker process. All poll data processing was serialized through a single Python process — no distribution of load, no way to spread memory pressure across processes. One worker means one process accumulating everything.

Container was too small. 512MB was fine when the system was handling 50 or 100 devices. At 500 devices with debug logging, it wasn't even close. The container would climb to 70%+ memory usage during normal operation and eventually hit the wall.

None of these are bugs. They're development-scale defaults that don't survive production-scale load. The kind of thing that works fine until it doesn't.

What the Investigation Showed

docker stats told most of the story: 362MB out of 512MB used, CPU pegged at 112%. The container was spending more time formatting log strings than processing actual data. The logs themselves were wall-to-wall SQL — every INSERT, every UPSERT, every COMMIT and ROLLBACK, printed twice with full parameter lists.

The container's restart policy was on-failure, so after the OOM kill it came back up, loaded the same config, and started climbing toward the same ceiling. Rinse and repeat until someone noticed.

The Fix

Three changes in docker-compose.override.yml. Nothing clever.

LOG_LEVEL: debug → info. This was the biggest impact. Stopped SQLAlchemy from echoing every query. Stopped the double-logging. Removed the single largest source of memory churn. If you're not actively debugging SQL, you don't need to see every INSERT scroll past.

GUNICORN_WORKERS: 1 → 2. Spreads request processing across two worker processes. Each process handles a portion of the incoming poll data, reducing per-process memory accumulation. Not a radical change, but it matters when you're processing thousands of writes per cycle.

Memory limit: 512MB → 1GB. Gives the API actual headroom for this workload. 512MB was a dev-era guess. 1GB reflects what the system actually needs when managing hundreds of devices.

Before and After

Before: 362MB / 512MB — 71% memory usage, climbing toward OOM.

After: 307MB / 1GB — 30% memory usage, stable under the same load.

Same 500 devices. Same poll interval. Same data volume. The system just stopped wasting resources on logging nobody was reading and got enough room to breathe.

The Takeaway

Development-scale configs don't survive production-scale load. This is not surprising. But it's easy to forget when the thing has been running fine for weeks at a smaller scale and you gradually crank it up.

Debug logging is expensive. Not just disk space — string formatting, memory allocation, I/O buffering. At scale, your logging layer can consume more resources than your actual application logic. Turn it off unless you're actively using it.

Container sizing matters. The number you picked when you had 50 devices is not the number you need at 500. Review your resource limits when your workload changes. docker stats is right there.

If you're self-hosting this with more than a couple hundred devices, don't run the default dev config. Bump your memory limits. Set the log level to info. Give the API more than one worker. The defaults are tuned for a developer laptop, not a production deployment.

Better it dies in a test environment than at 2am managing real infrastructure.