At approximately 8:00AM PST, our outbound traffic was greatly reduced by an overloaded NAT server. Our outbound bandwidth in the us-west-2 region of AWS was saturated and we were unable to contact services outside our internal network.
During the affected time, notifications and metric updates were severely delayed. Users making use of Google authentication would have been unable to access the management dashboard. Users logged into the management dashboard would have been unable to search for subscribers and incidents.
Since restoring service, we have implemented a large number of changes to improve how we allocate capacity and manage consumption of our network. Additionally, we are homing in on the several factors that led to this incident occurring, and we are working the teams involved to ensure this does not occur in the future.
Going forward, we are implementing the ability to fail-over to another AWS region. This would have allowed us to fully restore functionality on the order of minutes, not hours.