Why every deploy needs a rollback window
Your health check is lying to you. It says the new revision is up. The container booted, it answered HTTP 200 on /health, the load balancer shifted traffic over. Everything looks green. Then your Sentry inbox explodes because every checkout is throwing 500s and the bug is in the new code. The rollback button is one click away, and you click it three minutes too late.
Health checks are necessary, but they are not sufficient. They prove the process is alive. They cannot prove the new code is correct under real traffic. The gap between "boots" and "actually works in production" is where most outages live.
The cheapest reliability upgrade you can buy
A rollback window is the simplest possible fix. After your new revision takes traffic, you wait some short period of time, you watch a few real metrics, and if something obvious goes wrong you automatically switch back to the previous version. That is it. No traffic splitting, no fancy service mesh, no extra infrastructure. Just a brief observation period with a finger on the rollback button, except the finger is a script.
Most teams that have any kind of deploy pipeline have something that looks vaguely like this. They post in Slack when a deploy ships and hover over the metrics dashboard for ten minutes. The problem is that this only works at human scale. It does not work for indie projects that deploy at 2am, it does not work for unattended CI deploys, and it does not work when the on call person is grabbing lunch. Automating it takes the human out of the loop and makes every deploy safer.
What metrics actually matter
You do not need a hundred metrics. You need three.
Error rate. The percentage of requests returning 5xx, both from the application and from the load balancer itself. This is the single most important signal because it is the most direct measure of "did my new code break something". If error rate jumps from 0.1 percent to 12 percent during the window, roll back.
Latency. The 95th percentile of response time. A new revision that is functionally correct but ten times slower is also a regression worth catching. Latency is noisier than error rate, so the threshold should be loose (say 3x the previous baseline) and you should look at sustained breaches, not single spikes.
Healthy host count. If the new revision keeps crashing and restarting under real traffic, you will see the healthy host count flap. ECS deployment circuit breakers catch this case for you, but having it in the rollback signal too is cheap insurance.
The hard part is not the metric, it is the threshold. Set it too tight and you roll back on noise. Set it too loose and you do not catch the real regression. Two rules of thumb that work in practice: always require a minimum request count (so a quiet env does not trigger rollbacks on a single error), and compare against the previous deploy rather than against an absolute number.
How long should the window be?
Five minutes is the right starting point for most apps. It is long enough to catch the obvious regressions: bad config, broken database migration, expired credential, infinite loop in a hot path. It is short enough that nobody is sitting around waiting for their deploy to finish.
There are two cases where you might want a longer window. If your app has scheduled jobs that run on a cron, you may want to wait long enough to see at least one complete cycle. And if your traffic is bursty (most of it lands at noon and at 6pm) then a five minute window deployed at 3am will not see any traffic at all. For those cases, fifteen minutes is reasonable, half an hour is the upper bound.
Do you really need traffic splitting?
The full canary playbook involves splitting traffic between the old and new revision: 10 percent to new for five minutes, then 50 percent for another five, then 100 percent. It is great when you have it, and there are tools that make it work. For most teams it is also more complexity than it is worth.
ECS rolling deploy with a deployment circuit breaker already shifts traffic gradually as new tasks become healthy. The new revision serves a small slice for the first few seconds, then a larger slice as more tasks come up, then full traffic when the rollout completes. You get most of the benefit of a canary without the operational overhead of running two deployments side by side.
What you do not get from rolling deploy is the post deploy observation window. The deploy is "done" the moment all tasks are healthy. That is exactly the gap we are filling. A short observation window with auto rollback gives you 90 percent of the canary value at 10 percent of the complexity.
What we ship in Eigon
Every Eigon environment can opt into canary monitoring. It is one toggle in the dashboard. When it is on, every successful deploy enters a configurable monitoring window before being marked SUCCEEDED. A background watcher polls ALB error rate every 30 seconds. If error rate exceeds the configured threshold and the window has seen more than the minimum request count, the deployment is automatically rolled back to the previous task definition. Otherwise the deploy is marked SUCCEEDED at the end of the window.
Defaults: 5 minute window, 5 percent error rate threshold, 20 minimum requests. You can move all three from the dashboard. The defaults are tuned to be safe for low traffic side projects without being so loose they miss real regressions.
The rollback itself is a single ECS service update that switches the task definition back to the prior ARN we recorded right before the deploy started. ForceNewDeployment is set so the change goes out immediately. End to end, a rollback completes in about a minute on a typical environment.
The honest tradeoff
A rollback window adds a few minutes of "the deploy is not officially done yet" to the end of every successful deploy. If you deploy every commit and you deploy a lot, that adds up. The right way to think about it is the same way you think about backups: you almost never need it, and the one time you do, it pays for every other run combined.
For projects that ship to real users, the math is overwhelmingly in favour of having the window. A single bad deploy without a rollback can cost you a customer. A thousand good deploys with a rollback window cost you a few minutes each.
Turn it on for your environment
Canary deploys with auto rollback ship in Eigon v0.7. One toggle in the dashboard, sensible defaults out of the box.
Read the docs