On June 20th between 1306 PDT and 1334 PDT access to Hosted Pages and Public API was degraded due to a large influx of traffic erroneously bypassing our caching layer. During this time the Management Portal and Authenticated API services were slow to respond but available.
In order to immediately restore service quality to the remainder of Statuspage customers, we restricted traffic on the impacted domain, then started working on a permanent solution. We ensured the customer was not undergoing an incident prior to restricting traffic, and communicated with them immediately about the temporary restriction. At 1455 PDT a fix was deployed to our production infrastructure. The incident was resolved after verifying that the service was handling the increased traffic correctly.
The root cause of this incident was inadvertently exposing an endpoint that was bypassing our caching mechanisms. Isolation of the cause was impeded by several issues. One hinderance was detection of the issue being delayed because our alerting mechanisms didn't fire until ten minutes after the increase in traffic occurred. Another impediment was a lack of application instrumentation required for diagnosis.
As such, we have scheduled a series of changes in order to make improvements in these areas and ensure that the service is more resilient going forward:
We apologize for the disruption in service as a result of this incident and thank you for trusting us with your incident communication. If you have any questions relating to this incident, please do not hesitate to contact us at hi@statuspage.io.