Site unavailable due to errant database migration
Incident Report for Atlassian Statuspage
Postmortem

Summary

On 08/12/2019, between 08:09am - 08:16am, there was a hard outage across multiple components of Statuspage. This event was triggered by an errant database migration.

Background

As part of an application change pertaining to authentication changes, we needed to drop some legacy attributes from primary database, requiring an ad-hoc database migration. We expected this migration to be a transparent change unfortunately it lead to a ~ 7 minute outage across our services.

Cause

Database migrations that involve actions such as "dropping columns" introduce a table level lock. While this is not the first time we have worked on database migrations of this nature, on this particular day we had increased activity from some of our worker jobs that were causing intermittent "locks" thereby queuing the database migration. This lead to an increased time to process the actual migration script, and blocking subsequent calls to those specific database tables. This inadvertent block attributed to a 7 minute outage across our services.

What are we changing going forward ?

Reliability and uptime for our services remain top priority. We will be hardening our ad-hoc database migration process such that it takes into consideration increased database activity and avoids impacting subsequent operations to our services.

We apologize for the disruption in our service as a result of this incident and thank you for trusting us with your incident communication. Please use this form to contact us, incase you have any further questions regarding this outage.

Posted about 1 month ago. Sep 06, 2019 - 10:55 PDT

Resolved
This incident has been resolved.
Posted 2 months ago. Aug 12, 2019 - 09:50 PDT
Monitoring
A routine data migration was found to have locked the primary database, causing request timeouts for all inbound requests.

Everything has since recovered and is operating normally.
Posted 2 months ago. Aug 12, 2019 - 09:26 PDT
Identified
We're investigating site unresponsiveness across all web entities
Posted 2 months ago. Aug 12, 2019 - 09:14 PDT
This incident affected: Notifications (Email, SMS, Webhook, Twitter), Automation (Pingdom Email, New Relic Email, Generic Email, PagerDuty Webhook, JiraOps), Management (Web Portal, Authenticated API, DNS Validation, SSL Provisioning, Billing), Hosted Pages (HTTP Pages, HTTPS Pages, Status Embed Widget, Public API, Shortlinks), System Metrics (Pingdom Integration, Librato Integration, New Relic Integration, Datadog Integration, Custom Integration), Support Sites (Ticketing, Knowledge Base, API Documentation), Third Party Components (Statuspage, External), Chat Integrations (HipChat, Slack), and Authentication (Admin User+Pass, Admin Google Auth, Admin SAML 2.0, Page Access Users, Page Google Auth, Page SAML 2.0, Page IP Restriction).