On December 7, 2021, between 15:54 UTC and December 8, 2021, at 01:55 UTC, Atlassian Cloud services using AWS services in the US-EAST-1 region experienced a failure. This affected customers using Atlassian Access, Bitbucket Cloud, Compass, Confluence Cloud, the Jira family of products, and Trello. Products were unable to operate as expected, resulting in partial or complete degradation of services. The event was triggered by an AWS networking outage in US-EAST-1 affecting multiple AWS services and led to the inability to access AWS APIs and the AWS management console. The incident was first reported by Atlassian Access whose monitoring detected faults accessing DynamoDB services in the region. Recovery of affected Atlassian services occurred on a service-by-service basis from 2021-12-07 21:50 UTC when the underlying AWS services also began to recover. Full recovery of Atlassian Cloud services was notified at 2021-12-08 1:55 UTC.
The overall impact occurred between December 7, 2021, between 15:54 UTC and December 8, 2021, at 01:55 UTC. The incident caused partial to complete service disruption of Atlassian Cloud services in the US-EAST-1 region. Product-specific impacts are listed below.
The primary impact for customers of Jira Software, Jira Service Management and Jira Work Management hosted in the US-EAST-1 region, was being unable to scale up, which caused slow response times for web requests and delays in background job processing, including webhooks in the AP region. There was significant latency for customers accessing Jira. Some customers experienced service unavailability while the incident took place.
Jira Align experienced an email outage for US customers due to the AWS Service outage that affected many of the AWS Services including Simple Email Service. A small percentage of Jira Align emails were not sent due to the AWS incident.
Bitbucket Pipelines was unavailable and steps failed to be executed.
For Jira Automation, tenant’s rules execution were delayed since CloudWatch was affected.
Confluence experienced minor impact due to upstream services impacting user management, search, notifications, and media. At the same time Confluence was impacted by error rates related to the inability to scale up, and GraphQL had higher latencies.
Trello email-to-board and dashcards features experienced degraded performance.
Atlassian Access reported product transfers from one organization failed intermittently. Admins were not able to update features like IP Allowlist, Audit Logs, Data Residency, Custom Domain Email Notification and Mobile Application Management. Yet, users were able to access and view these features. During the incident, emails to admins experienced a delay. There was degraded experience when creating and deleting API tokens.
Statuspage was largely unaffected. However, notification workers could not scale up and communications to customers were delayed, though they could be replayed later. The incident also impacted users trying to sign in to manage portals and private pages.
Compass experienced a minor impact on its ability to write to its primary database store. No core features were affected.
Atlassian's customers could have experienced stale data issues in production, US-EAST-1 for ~30s, against expected 5s at p99, because of delayed token resolution.
The provisioning of new cloud tenants was also impacted until the recovery of the services.
The issue was caused by a problem with several network devices within AWS’s internal network. These devices were receiving more traffic than they were able to process, which led to elevated latency and packet loss. As a result, it affected multiple AWS services which Atlassian's platform relies on, causing service degradation and disruption to the products mentioned above. For more information in regards to the root cause, see Summary of the AWS Service Event in the Northern Virginia (US-EAST-1) Region.
There were no relevant Atlassian-driven events in the lead-up that have been identified to cause or contribute to this incident.
We know that outages impact your productivity. We are taking immediate steps to improve the Atlassian platform's resiliency and availability to reduce the impact of such an event in the future. While Atlassian's Cloud services do run in several regions (US EAST and WEST, AP, EU CENTRAL and WEST, among others) and data is replicated across several regions to increase the resilience against outages of this magnitude, we have identified and are taking actions that include improvements to our region failover process. This will minimize the impact of future outages on Atlassian’s Cloud services and provide better support for our customers.
We are prioritizing the following actions to avoid repeating this type of incident:
We apologize to customers whose services were impacted during this incident.
Atlassian Customer Support