Management Portal is unavailable

Incident Report for Atlassian Statuspage

Postmortem

SUMMARY

On December 1, 2020, between 14:51 - 15:42 UTC, Statuspage customers were unable to login to the product. The event was triggered due to an instrumentation change that resulted in the exhaustion of resources on one of the serving instances.

The incident was detected within 7 minutes by our internal monitoring system and mitigated by terminating the faulty instance and subsequently reverting the change in question. The total time to resolution was 51 minutes.

TECHNICAL REASONS

Our stack utilizes GraphQL requests for backend calls. We recently implemented tracing in our middleware to measure the performance of GraphQL requests. The issue was caused by a middleware change (that included the new tracing library) which exhausted thread resources because a new thread was created per request. As a result, GraphQL requests were not getting processed and users began to receive HTTP 500 errors.

ROOT CAUSE

The tracing library was set up to create a thread for each GraphQL request being served, which caused one instance to run out of resources and unable to create threads. Our healthchecks were not set up to identify this particular type of failure and failed to remove the faulty instance out of rotation automatically.

REMEDIAL ACTIONS PLAN & NEXT STEPS

Reliability and uptime for our services remain a top priority. As part of this effort to prevent this class of problem, we are going to do the following -

Enhanced Healthchecks - We will be enhancing our healthchecks across all services to ensure that instances with depleted thread resources are removed from our load balancer pools and reset accordingly. This prevents us from serving traffic on instances with deprived resources.
Instrumentation - We will be incorporating an alerting mechanism to page our engineers when the rate of thread usage changes in our services.
Retries on Frontend - To protect customer experience in management portal, we plan to incorporate conditional retries to ensure that the experience is tolerable in case our GraphQL service has intermittent problems.

Posted Dec 16, 2020 - 13:53 PST

Resolved

This incident has been resolved.

Posted Dec 01, 2020 - 08:52 PST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Dec 01, 2020 - 07:47 PST

Identified

We have identified the issue, and are working to deploy a fix.

Posted Dec 01, 2020 - 07:37 PST

Investigating

We are currently investigating an issue with the Statuspage management portal being unavailable when attempting to login to manage.statuspage.io. Status pages themselves are unaffected.

We will provide more updates as they become available.

Posted Dec 01, 2020 - 06:52 PST

This incident affected: Management (Web Portal).