Delays in metrics and notifications

Incident Report for Atlassian Statuspage

Postmortem

On July 11th, a production deploy inadvertently caused processing of background jobs to stop. All of our customers rely on StatusPage for timely communication to their users, and in this scenario we fell short of providing this for you. I'm very sorry for the pain this may have caused for you and your users. Below are details of the incident and how we're working toward ensuring this won't happen again in the future.

What happened

One of our big ongoing projects at the moment is upgrading to a more recent version of Rails. When we upgraded one of our Rails dependencies, we missed a call to an API that had been removed in the new version of the library. That call was in a script that was responsible for activating our worker tier after a deploy is validated.

What this meant was that upon coming up, the worker tier for our new deploy was stuck in its inactive state and not processing jobs. Among others, the jobs flowing through our worker tier are responsible for sending notifications, refreshing metrics, and invalidating the cache on public status pages.

Another issue that we noticed during the incident was that a change in a third party tool our deploy script was using caused it to mistakenly perform a full deploy of the prior release instead of rolling back. While rollbacks take a maximum of five minutes, our full deploys clock in at around fifteen.

Timeline

19:47 UTC - New release is deployed
19:48 UTC - One of our engineers notices all the workers are inactive and starts our incident response procedure.
19:55 UTC - The issue is identified and a rollback is initiated.
20:09 UTC - Rollback completes, service is restored and working through the backlog of jobs.

The total time services were impacted was 22 minutes.

Product Impact

Three major symptoms were observed during the time when the worker tier was inoperable:

Any email, sms, and webhook notifications sending was delayed, but no notifications were lost, and all were eventually delivered once the rollback was completed.
Public metrics would have displayed a gap during the time that the worker tier was inoperable, but eventually would have filled in (except for New Relic).
Incident updates to pages would not have propagated to the public facing status pages due to the cache busting jobs not being processed

Remediation

The following are steps we're taking to ensure we can avoid similar issues in the future

Move to a different API for our deploy scripts, in order to ensure that it properly performs rollbacks
Excercise rollbacks on a biweekly schedule to ensure they work as intended
Ensure that health checks account for the integrity of our worker activation script
Bypass our worker tier for public status page cache invalidations

Next Steps

Thank you again for being supportive customers of StatusPage. Our team is here to talk face-to-face as necessary, so please feel free to reach out with any questions or comments.

Best regards,

Andrei Bocan

Posted Jul 28, 2017 - 12:54 PDT

Resolved

Following a recent deployment, processing of background jobs was stopped due to a bug being introduced. This resulted in email, sms, and webhook notifications not getting sent. Also during this time, metrics were not refreshing and incident updates could have taken up to 15 minutes to appear. This incident has been resolved.

Posted Jul 11, 2017 - 13:38 PDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jul 11, 2017 - 13:21 PDT

Identified

We're experiencing issues with our background job queue, incident updates will be delayed. We've identified the issue and are in the process of deploying a fix.

Posted Jul 11, 2017 - 12:58 PDT