On December 1, 2020, between 14:51 - 15:42 UTC, Statuspage customers were unable to login to the product. The event was triggered due to an instrumentation change that resulted in the exhaustion of resources on one of the serving instances.
The incident was detected within 7 minutes by our internal monitoring system and mitigated by terminating the faulty instance and subsequently reverting the change in question. The total time to resolution was 51 minutes.
Our stack utilizes GraphQL requests for backend calls. We recently implemented tracing in our middleware to measure the performance of GraphQL requests. The issue was caused by a middleware change (that included the new tracing library) which exhausted thread resources because a new thread was created per request. As a result, GraphQL requests were not getting processed and users began to receive HTTP 500 errors.
The tracing library was set up to create a thread for each GraphQL request being served, which caused one instance to run out of resources and unable to create threads. Our healthchecks were not set up to identify this particular type of failure and failed to remove the faulty instance out of rotation automatically.
Reliability and uptime for our services remain a top priority. As part of this effort to prevent this class of problem, we are going to do the following -