Unresponsive status pages due to influx of non-human traffic

Incident Report for Atlassian Statuspage

Postmortem

Summary

On June 20th between 1306 PDT and 1334 PDT access to Hosted Pages and Public API was degraded due to a large influx of traffic erroneously bypassing our caching layer. During this time the Management Portal and Authenticated API services were slow to respond but available.

In order to immediately restore service quality to the remainder of Statuspage customers, we restricted traffic on the impacted domain, then started working on a permanent solution. We ensured the customer was not undergoing an incident prior to restricting traffic, and communicated with them immediately about the temporary restriction. At 1455 PDT a fix was deployed to our production infrastructure. The incident was resolved after verifying that the service was handling the increased traffic correctly.

What we are changing going forward

The root cause of this incident was inadvertently exposing an endpoint that was bypassing our caching mechanisms. Isolation of the cause was impeded by several issues. One hinderance was detection of the issue being delayed because our alerting mechanisms didn't fire until ten minutes after the increase in traffic occurred. Another impediment was a lack of application instrumentation required for diagnosis.

As such, we have scheduled a series of changes in order to make improvements in these areas and ensure that the service is more resilient going forward:

We are in the process of conducting a full audit of our public-facing endpoints to ensure that the caching characteristics for all of them are in line with our traffic expectations.
We are adding additional tooling to our monitoring to ensure that any degradation in customer experience is surfaced as early as possible.
We will work on isolating our web-facing services further, so that increased traffic on one of our services cannot adversely impact the availability of other services.

We apologize for the disruption in service as a result of this incident and thank you for trusting us with your incident communication. If you have any questions relating to this incident, please do not hesitate to contact us at hi@statuspage.io.

Posted Jul 05, 2018 - 12:34 PDT

Resolved

This incident has been resolved.

Posted Jun 20, 2018 - 16:01 PDT

Update

We have confirmed that all other services are back to normal and are taking steps to permanently resolve the problem for the affected customer.

Posted Jun 20, 2018 - 14:20 PDT

Identified

We have identified the issue as errant traffic from a single customer and have taken action to mitigate the issue, which appears to only affect status pages. The Management Portal is working as normal.

Posted Jun 20, 2018 - 13:38 PDT

Investigating

The site is currently experiencing a higher than normal amount of load, and may be causing pages to be slow or unresponsive. We're investigating the cause and will provide an update as soon as possible.

Posted Jun 20, 2018 - 13:18 PDT

This incident affected: Hosted Pages (HTTP Pages, HTTPS Pages, Status Embed Widget, Public API, Shortlinks), Authentication (Admin User+Pass, Admin Google Auth, Admin SAML 2.0, Page Access Users, Page Google Auth, Page SAML 2.0, Page IP Restriction), and Management (Web Portal, Authenticated API, DNS Validation, SSL Provisioning, Billing).