On the 15th of August between 1043 PDT and 1102 PDT an incident occurred which caused our application to run in a degraded state between 1043 and 1055 PDT, and eventually resulted in the application being completely unavailable between 1055 and 1102 PDT. Functionality was fully restored at 1102 PDT. In its degraded state, the application was serving customer status pages that had already been cached, however uncached pages, the management portal, and the authenticated API were responding with 500 errors.
In all of our services, once a change is merged into the mainline branch of our version control system, that version of the code is deployed to a fresh set of compute nodes. After the new deployment has been validated as healthy (by first running a basic healthcheck and then a set of semantic checks against it), the DNS routing for that service is updated so that the new deployment becomes active and starts receiving production traffic. The old deployment nodes are then preserved for another hour, so that if any issues are discovered in the new version, we can quickly roll back to the old version of the service.
The underlying cause for this outage was an erroneous configuration change that was applied to our database multiplexing service.
In this case, the service validation code that was responsible for validating the health of the multiplexing service performed a check that did not exercise the parts of our configuration that had been broken, which led to the new deployment of the database multiplexing service being activated.
At that point, several fresh deployments of other services which depended on the database multiplexing service started failing, as they were unable to get a connection to the database multiplexing service; we incorrectly diagnosed this as independent breakages of the other services and reacted by reverting those changes.
Meanwhile, the existing deployments of other services were still using the old version of the database multiplexing service, and as a result, the application was still healthy, albeit running on "old" code. An hour later, after the old version of the database multiplexing service was automatically retired, the live application began attempting to use the broken database multiplexing service, which resulted in the partial degradation and eventually complete outage of the Statuspage platform.
This incident highlighted several gaps in our deployment process, and we have planned the following improvements to ensure that this class of failure does not reoccur:
We apologize for the disruption in service as a result of this incident and thank you for trusting us with your incident communication. If you have any questions relating to this incident, please do not hesitate to contact us at hi@statuspage.io.