We kicked off a deployment to production at 1746 UTC that was then discovered to contain bad code that resulted in 500 errors across the Statuspage.io platform. This issue affected status pages, the Manage portal, and API calls.
At 1800 UTC the misbehaving version of deployed code hit live servers and all status pages began 500ing shortly thereafter. We detected the issue at around 1803 UTC and made the call to rollback at 1805 UTC.
Unfortunately, it turned out our CI-automated rollback process was in a broken state. It was not until 1812 UTC that this process hard failed and we then opted to kick off a manual rollback, which brought the site back online at 1814 UTC.
Since this incident, we have optimized our rollback policy and process to ensure that we respond to situations like this more immediately, and the rollback itself is more swift. Furthermore, we also have a regularly scheduled event to test our rollback process and ensure it's still working the way we expect. Finally, we have recently shipped additional quality checks that are applied to our production rollout process to ensure broken deployments can't make it to live servers in the first place.