Statuspage platform outage (manage and hosted pages and API throwing 500s)

Incident Report for Atlassian Statuspage

Postmortem

Summary

On the 15th of August between 1043 PDT and 1102 PDT an incident occurred which caused our application to run in a degraded state between 1043 and 1055 PDT, and eventually resulted in the application being completely unavailable between 1055 and 1102 PDT. Functionality was fully restored at 1102 PDT. In its degraded state, the application was serving customer status pages that had already been cached, however uncached pages, the management portal, and the authenticated API were responding with 500 errors.

Background

In all of our services, once a change is merged into the mainline branch of our version control system, that version of the code is deployed to a fresh set of compute nodes. After the new deployment has been validated as healthy (by first running a basic healthcheck and then a set of semantic checks against it), the DNS routing for that service is updated so that the new deployment becomes active and starts receiving production traffic. The old deployment nodes are then preserved for another hour, so that if any issues are discovered in the new version, we can quickly roll back to the old version of the service.

Cause

The underlying cause for this outage was an erroneous configuration change that was applied to our database multiplexing service.

In this case, the service validation code that was responsible for validating the health of the multiplexing service performed a check that did not exercise the parts of our configuration that had been broken, which led to the new deployment of the database multiplexing service being activated.

At that point, several fresh deployments of other services which depended on the database multiplexing service started failing, as they were unable to get a connection to the database multiplexing service; we incorrectly diagnosed this as independent breakages of the other services and reacted by reverting those changes.

Meanwhile, the existing deployments of other services were still using the old version of the database multiplexing service, and as a result, the application was still healthy, albeit running on "old" code. An hour later, after the old version of the database multiplexing service was automatically retired, the live application began attempting to use the broken database multiplexing service, which resulted in the partial degradation and eventually complete outage of the Statuspage platform.

What we are changing going forward

This incident highlighted several gaps in our deployment process, and we have planned the following improvements to ensure that this class of failure does not reoccur:

We will expand the service validation code of our database multiplexing service to ensure that each configuration used by a depending service is exercised before deployment is validated as healthy.
Add a batch of service-level tests for our database multiplexing service to our CI pipeline, in order to validate the correctness of configuration changes to it.
For infrastructure-critical services such as the database multiplexing service, extend the lifetime of old deployments to increase the chances of being able to roll back successfully.

We apologize for the disruption in service as a result of this incident and thank you for trusting us with your incident communication. If you have any questions relating to this incident, please do not hesitate to contact us at hi@statuspage.io.

Posted Aug 30, 2018 - 13:50 PDT

Resolved

We've confirmed our systems are operating as expected again.

Once we complete our internal incident review process, we will publish a more detailed postmortem of what went wrong, along with steps we're taking to prevent this from happening again in the future.

Posted Aug 15, 2018 - 11:46 PDT

Monitoring

We've rolled back the database configuration change. All systems should be operational again now, however we will continue to monitor.

Posted Aug 15, 2018 - 11:03 PDT

Identified

We've identified the problem as being a change made to our database configuration which causes hosted pages, the management portal and the API to all respond with 500 errors. In-progress notification sends are also currently paused as a result. Next update in 5 minutes.

Posted Aug 15, 2018 - 10:56 PDT

Investigating

Some attempts to load pages may be resulting in errors. We are actively investigating the issue.

Posted Aug 15, 2018 - 10:53 PDT

This incident affected: Hosted Pages (HTTP Pages, HTTPS Pages, Status Embed Widget, Public API, Shortlinks), Management (Web Portal, Authenticated API), Notifications (Email, SMS, Webhook, Twitter), and System Metrics (Pingdom Integration, Librato Integration, New Relic Integration, Datadog Integration, Custom Integration).