Greatly degraded notifications delivery, metrics updates, and authentication

Incident Report for Atlassian Statuspage

Postmortem

At approximately 8:00AM PST, our outbound traffic was greatly reduced by an overloaded NAT server. Our outbound bandwidth in the us-west-2 region of AWS was saturated and we were unable to contact services outside our internal network.

By 8:12AM PST, our on-call staff was notified to the problem.
By 8:20AM PST, we had identified the inability to establish an outside connection as the primary cause of the loss of functionality.
By 8:50AM PST, we had escalated to the Atlassian global incident management team.
By 9:20AM PST, the incident was escalated to the Atlassian network engineering team.
By 10:34AM PST, a network engineer managed to identify and began to mitigate the cause. Service started to return to normal.
By 11:22AM PST, full functionality had resumed.

During the affected time, notifications and metric updates were severely delayed. Users making use of Google authentication would have been unable to access the management dashboard. Users logged into the management dashboard would have been unable to search for subscribers and incidents.

Since restoring service, we have implemented a large number of changes to improve how we allocate capacity and manage consumption of our network. Additionally, we are homing in on the several factors that led to this incident occurring, and we are working the teams involved to ensure this does not occur in the future.

Going forward, we are implementing the ability to fail-over to another AWS region. This would have allowed us to fully restore functionality on the order of minutes, not hours.

Posted Aug 29, 2017 - 14:57 PDT

Resolved

This incident has been resolved. We will publish a postmortem with additional details shortly.

Posted Aug 28, 2017 - 13:41 PDT

Update

We're continuing to monitor this and our network remains stable. Notifications, metrics, and authentication are functioning as normal.

Posted Aug 28, 2017 - 12:47 PDT

Update

Our network remains stable and we're continuing to monitor this.

Posted Aug 28, 2017 - 11:49 PDT

Monitoring

Our network has become stable and our backlog of outbound notifications and metrics refreshing has caught up. Authentication issues have also be resolved and we're continuing to monitor these issues.

Posted Aug 28, 2017 - 11:15 PDT

Update

We are working to bring our network up to full health again, but our backlog of outbound notifications has begun to catch up. We are still experiencing sporadic authentication issues and delays to notifications and metrics, and are continuing to work on a long term fix.

Posted Aug 28, 2017 - 10:23 PDT

Identified

We have identified the potential root cause saturating our outbound network requests, and are working on a fix. Third party authentication is still unavailable. Processing of outbound notifications and inbound metrics ingestion are still delayed, but will be backfilled once issue is fully resolved.

Posted Aug 28, 2017 - 09:19 PDT

Investigating

We are investigating an issue leading to delayed delivery of notifications, updates to public metrics, and authentication with third party providers.

Posted Aug 28, 2017 - 08:33 PDT

This incident affected: Authentication (Admin User+Pass, Admin Google Auth, Admin SAML 2.0, Page Access Users, Page Google Auth, Page SAML 2.0, Page IP Restriction), Notifications (Email, SMS, Webhook, Twitter), and System Metrics (Pingdom Integration, Librato Integration, New Relic Integration, Datadog Integration, Custom Integration).