On July 11th, a production deploy inadvertently caused processing of background jobs to stop. All of our customers rely on StatusPage for timely communication to their users, and in this scenario we fell short of providing this for you. I'm very sorry for the pain this may have caused for you and your users. Below are details of the incident and how we're working toward ensuring this won't happen again in the future.
One of our big ongoing projects at the moment is upgrading to a more recent version of Rails. When we upgraded one of our Rails dependencies, we missed a call to an API that had been removed in the new version of the library. That call was in a script that was responsible for activating our worker tier after a deploy is validated.
What this meant was that upon coming up, the worker tier for our new deploy was stuck in its inactive state and not processing jobs. Among others, the jobs flowing through our worker tier are responsible for sending notifications, refreshing metrics, and invalidating the cache on public status pages.
Another issue that we noticed during the incident was that a change in a third party tool our deploy script was using caused it to mistakenly perform a full deploy of the prior release instead of rolling back. While rollbacks take a maximum of five minutes, our full deploys clock in at around fifteen.
19:47 UTC - New release is deployed
19:48 UTC - One of our engineers notices all the workers are inactive and starts our incident response procedure.
19:55 UTC - The issue is identified and a rollback is initiated.
20:09 UTC - Rollback completes, service is restored and working through the backlog of jobs.
The total time services were impacted was 22 minutes.
Three major symptoms were observed during the time when the worker tier was inoperable:
The following are steps we're taking to ensure we can avoid similar issues in the future
Move to a different API for our deploy scripts, in order to ensure that it properly performs rollbacks
Excercise rollbacks on a biweekly schedule to ensure they work as intended
Ensure that health checks account for the integrity of our worker activation script
Bypass our worker tier for public status page cache invalidations
Thank you again for being supportive customers of StatusPage. Our team is here to talk face-to-face as necessary, so please feel free to reach out with any questions or comments.
Best regards,
Andrei Bocan