Infrastructure Issues - Billing and signup impacted

Incident Report for Atlassian Statuspage

Postmortem

SUMMARY

On December 7, 2021, between 15:54 UTC and December 8, 2021, at 01:55 UTC, Atlassian Cloud services using AWS services in the US-EAST-1 region experienced a failure. This affected customers using Atlassian Access, Bitbucket Cloud, Compass, Confluence Cloud, the Jira family of products, and Trello. Products were unable to operate as expected, resulting in partial or complete degradation of services. The event was triggered by an AWS networking outage in US-EAST-1 affecting multiple AWS services and led to the inability to access AWS APIs and the AWS management console. The incident was first reported by Atlassian Access whose monitoring detected faults accessing DynamoDB services in the region. Recovery of affected Atlassian services occurred on a service-by-service basis from 2021-12-07 21:50 UTC when the underlying AWS services also began to recover. Full recovery of Atlassian Cloud services was notified at 2021-12-08 1:55 UTC.

IMPACT

The overall impact occurred between December 7, 2021, between 15:54 UTC and December 8, 2021, at 01:55 UTC. The incident caused partial to complete service disruption of Atlassian Cloud services in the US-EAST-1 region. Product-specific impacts are listed below.

The primary impact for customers of Jira Software, Jira Service Management and Jira Work Management hosted in the US-EAST-1 region, was being unable to scale up, which caused slow response times for web requests and delays in background job processing, including webhooks in the AP region. There was significant latency for customers accessing Jira. Some customers experienced service unavailability while the incident took place.

Jira Align experienced an email outage for US customers due to the AWS Service outage that affected many of the AWS Services including Simple Email Service. A small percentage of Jira Align emails were not sent due to the AWS incident.

Bitbucket Pipelines was unavailable and steps failed to be executed.

For Jira Automation, tenant’s rules execution were delayed since CloudWatch was affected.

Confluence experienced minor impact due to upstream services impacting user management, search, notifications, and media. At the same time Confluence was impacted by error rates related to the inability to scale up, and GraphQL had higher latencies.

Trello email-to-board and dashcards features experienced degraded performance.

Atlassian Access reported product transfers from one organization failed intermittently. Admins were not able to update features like IP Allowlist, Audit Logs, Data Residency, Custom Domain Email Notification and Mobile Application Management. Yet, users were able to access and view these features. During the incident, emails to admins experienced a delay. There was degraded experience when creating and deleting API tokens.

Statuspage was largely unaffected. However, notification workers could not scale up and communications to customers were delayed, though they could be replayed later. The incident also impacted users trying to sign in to manage portals and private pages.

Compass experienced a minor impact on its ability to write to its primary database store. No core features were affected.

Atlassian's customers could have experienced stale data issues in production, US-EAST-1 for ~30s, against expected 5s at p99, because of delayed token resolution.

The provisioning of new cloud tenants was also impacted until the recovery of the services.

ROOT CAUSE

The issue was caused by a problem with several network devices within AWS’s internal network. These devices were receiving more traffic than they were able to process, which led to elevated latency and packet loss. As a result, it affected multiple AWS services which Atlassian's platform relies on, causing service degradation and disruption to the products mentioned above. For more information in regards to the root cause, see Summary of the AWS Service Event in the Northern Virginia (US-EAST-1) Region.

There were no relevant Atlassian-driven events in the lead-up that have been identified to cause or contribute to this incident.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact your productivity. We are taking immediate steps to improve the Atlassian platform's resiliency and availability to reduce the impact of such an event in the future. While Atlassian's Cloud services do run in several regions (US EAST and WEST, AP, EU CENTRAL and WEST, among others) and data is replicated across several regions to increase the resilience against outages of this magnitude, we have identified and are taking actions that include improvements to our region failover process. This will minimize the impact of future outages on Atlassian’s Cloud services and provide better support for our customers.

We are prioritizing the following actions to avoid repeating this type of incident:

Enhance and strengthen our plans for cross-region resiliency and disaster recovery plans, including: continue practicing region failover in production, investigate and implement better resilience strategies for services, Active/Active or Active/Passive.
Improving and adopting multi-region architecture for services that do require it.
Exercise wargaming scenarios that will simulate this outage to assess customer view of the incident. This will allow us to create further action items to improve our region failover process.

We apologize to customers whose services were impacted during this incident.

Thanks,

Atlassian Customer Support

Posted Dec 16, 2021 - 08:57 PST

Resolved

This incident has been resolved.

Posted Dec 07, 2021 - 17:10 PST

Monitoring

Signin to the manage portal and certain private pages will resume usual authentication through Atlassian Access

Posted Dec 07, 2021 - 15:33 PST

Update

Signin to the manage portal and certain private pages will take place through a link sent via email until the authentication issues have been resolved.

Posted Dec 07, 2021 - 14:24 PST

Update

Notification services have recovered and are operational.

Posted Dec 07, 2021 - 13:55 PST

Update

We're investigating issues affecting notifications. More information will be made available as soon as we can determine the cause and work toward a fix.

Posted Dec 07, 2021 - 13:50 PST

Update

We are continuing to work on a fix for this issue.

Posted Dec 07, 2021 - 13:25 PST

Update

We are continuing to work on a fix for this issue.

Posted Dec 07, 2021 - 12:41 PST

Update

We're investigating issues affecting authentication and sign-in.

Posted Dec 07, 2021 - 11:49 PST

Identified

We're investigating issues affecting billing and signup, which may impact signing into the manage portal and private pages. More information will be made available as soon as we can determine the cause and work toward a fix.

Posted Dec 07, 2021 - 10:59 PST

This incident affected: Notifications (Email, SMS, Webhook, Twitter), Hosted Pages (HTTPS Pages), Authentication (Admin Google Auth, Admin SAML 2.0), Management (Billing), and Signup.