Email and SMS Notification Delays
Incident Report for Atlassian Statuspage
Postmortem

Summary

On Feb 21st between 03:00am PST and 09:11am PST, a performance degradation occurred for our notifications pipeline. This led to delays for both email and sms notifications.

The issue was caused by an unexpected influx of jobs being submitted to our notification processing queue. We identified these jobs as having been generated due to a malicious traffic pattern against an endpoint which had not been properly rate limited.

While impact was mitigated by 6am PST, performance of our notifications pipeline was only fully restored at 9:11am PST by a code update that discarded these jobs from our background worker queues.

Background

We have different types of background worker queues processing our various asynchronous workloads. One such queue is dedicated for notifications pipeline. Due to the time-sensitive nature of notification delivery, this queue has a very distinct set of priority settings, that ensure that notification jobs get processed as fast as possible.

Cause

After some investigation, we found that the specific class of job getting enqueued was generated by a cleanup job that had been erroneously prioritized. In addition to that, the cleanup jobs getting enqueued would be generated as a result of requests to one of our endpoints that was not subject to any rate limits. That could allow malicious actors to generate these jobs unimpeded.

What are we changing going forward

While malicious traffic is a natural part of running any web service, being able to withstand it without causing service degradation for our customers is crucial. We’re constantly working to improve the resiliency characteristics of Statuspage. For this particular incident, these are some of the steps we’re taking

• Audit our background jobs and ensure that there are no additional jobs which would unduly take priority over our notification jobs
• Add rate limiting around the previously identified endpoint

We apologize for the disruption in service as a result of this incident and thank you for trusting us with your incident communication. If you have any questions relating to this incident, please do not hesitate to contact us at hi@statuspage.io.

Posted Mar 14, 2019 - 14:56 PDT

Resolved
Email and SMS deliveries are continuing to happen in a timely fashion, and this issue is resolved. The window where delays were most prominent was between 3AM and 6AM Pacific time. Our engineering team will be sharing a more detailed postmortem in the days to come.
Posted Feb 21, 2019 - 10:44 PST
Monitoring
Email and SMS notifications are once again being delivered in a timely manner, and we are monitoring services closely.
Posted Feb 21, 2019 - 09:30 PST
Identified
We have identified an issue with our email and SMS notifications. While notifications are eventually being delivered, they are being delayed. Our engineering team is taking action to resolve the issue.
Posted Feb 21, 2019 - 08:45 PST
This incident affected: Notifications (Email, SMS).