Greatly degraded notifications delivery, metrics updates, and authentication
Incident Report for Atlassian Statuspage

At approximately 8:00AM PST, our outbound traffic was greatly reduced by an overloaded NAT server. Our outbound bandwidth in the us-west-2 region of AWS was saturated and we were unable to contact services outside our internal network.

  • By 8:12AM PST, our on-call staff was notified to the problem.
  • By 8:20AM PST, we had identified the inability to establish an outside connection as the primary cause of the loss of functionality.
  • By 8:50AM PST, we had escalated to the Atlassian global incident management team.
  • By 9:20AM PST, the incident was escalated to the Atlassian network engineering team.
  • By 10:34AM PST, a network engineer managed to identify and began to mitigate the cause. Service started to return to normal.
  • By 11:22AM PST, full functionality had resumed.

During the affected time, notifications and metric updates were severely delayed. Users making use of Google authentication would have been unable to access the management dashboard. Users logged into the management dashboard would have been unable to search for subscribers and incidents.

Since restoring service, we have implemented a large number of changes to improve how we allocate capacity and manage consumption of our network. Additionally, we are homing in on the several factors that led to this incident occurring, and we are working the teams involved to ensure this does not occur in the future.

Going forward, we are implementing the ability to fail-over to another AWS region. This would have allowed us to fully restore functionality on the order of minutes, not hours.

Posted 4 months ago. Aug 29, 2017 - 14:57 PDT

Resolved
This incident has been resolved. We will publish a postmortem with additional details shortly.
Posted 4 months ago. Aug 28, 2017 - 13:41 PDT
Update
We're continuing to monitor this and our network remains stable. Notifications, metrics, and authentication are functioning as normal.
Posted 4 months ago. Aug 28, 2017 - 12:47 PDT
Update
Our network remains stable and we're continuing to monitor this.
Posted 4 months ago. Aug 28, 2017 - 11:49 PDT
Monitoring
Our network has become stable and our backlog of outbound notifications and metrics refreshing has caught up. Authentication issues have also be resolved and we're continuing to monitor these issues.
Posted 4 months ago. Aug 28, 2017 - 11:15 PDT
Update
We are working to bring our network up to full health again, but our backlog of outbound notifications has begun to catch up. We are still experiencing sporadic authentication issues and delays to notifications and metrics, and are continuing to work on a long term fix.
Posted 4 months ago. Aug 28, 2017 - 10:23 PDT
Identified
We have identified the potential root cause saturating our outbound network requests, and are working on a fix. Third party authentication is still unavailable. Processing of outbound notifications and inbound metrics ingestion are still delayed, but will be backfilled once issue is fully resolved.
Posted 4 months ago. Aug 28, 2017 - 09:19 PDT
Investigating
We are investigating an issue leading to delayed delivery of notifications, updates to public metrics, and authentication with third party providers.
Posted 4 months ago. Aug 28, 2017 - 08:33 PDT
This incident affected: Authentication (Admin User+Pass, Admin Google Auth, Admin SAML 2.0, Page Access Users, Page Google Auth, Page SAML 2.0, Page IP Restriction), Notifications (Email, SMS, Webhook, Twitter), and Public Metrics (Pingdom, Librato, New Relic, Datadog, Custom).