Site unavailable due to errant database migration

Incident Report for Atlassian Statuspage

Postmortem

Summary

On 08/12/2019, between 08:09am - 08:16am, there was a hard outage across multiple components of Statuspage. This event was triggered by an errant database migration.

Background

As part of an application change pertaining to authentication changes, we needed to drop some legacy attributes from primary database, requiring an ad-hoc database migration. We expected this migration to be a transparent change unfortunately it lead to a ~ 7 minute outage across our services.

Cause

Database migrations that involve actions such as "dropping columns" introduce a table level lock. While this is not the first time we have worked on database migrations of this nature, on this particular day we had increased activity from some of our worker jobs that were causing intermittent "locks" thereby queuing the database migration. This lead to an increased time to process the actual migration script, and blocking subsequent calls to those specific database tables. This inadvertent block attributed to a 7 minute outage across our services.

What are we changing going forward ?

Reliability and uptime for our services remain top priority. We will be hardening our ad-hoc database migration process such that it takes into consideration increased database activity and avoids impacting subsequent operations to our services.

We apologize for the disruption in our service as a result of this incident and thank you for trusting us with your incident communication. Please use this form to contact us, incase you have any further questions regarding this outage.

Posted Sep 06, 2019 - 10:55 PDT

Resolved

This incident has been resolved.

Posted Aug 12, 2019 - 09:50 PDT

Monitoring

A routine data migration was found to have locked the primary database, causing request timeouts for all inbound requests.

Everything has since recovered and is operating normally.

Posted Aug 12, 2019 - 09:26 PDT

Identified

We're investigating site unresponsiveness across all web entities

Posted Aug 12, 2019 - 09:14 PDT

This incident affected: Hosted Pages (HTTP Pages, HTTPS Pages, Status Embed Widget, Public API, Shortlinks), Authentication (Admin User+Pass, Admin Google Auth, Admin SAML 2.0, Page Access Users, Page Google Auth, Page SAML 2.0, Page IP Restriction), Management (Web Portal, Authenticated API, DNS Validation, SSL Provisioning, Billing), Third Party Components (Statuspage, External), Automation (Pingdom Email, New Relic Email, Generic Email, PagerDuty Webhook, Jira Software Integration), Notifications (Email, SMS, Webhook, Twitter), System Metrics (Pingdom Integration, Librato Integration, New Relic Integration, Datadog Integration, Custom Integration), Support Sites (Ticketing, Knowledge Base, API Documentation), and Chat Integrations (Slack).