AWS issues causing Engine Yard Cloud Dashboard downtime

Incident Report for Engine Yard

Postmortem

Incident Timeline:

13:20 UTC: Our Virtual NOC alerts us to the failure of the monitoring URLs for the Engine Yard Cloud Dashboard. At this point EY Cloud is returning errors and so is inaccessible for customers.
13:30 UTC: Investigation of the issue identifies the source as being a failed AWS instance, this instance is restarted at AWS in order to restore the instance. The restart hangs in the “stopping” state at AWS.
13:35 UTC: Amazon publish a Service Status message to inform customers that a connectivity issue is affecting some instances in a single Availability Zone in the US-East-1 Region.
13:35 - 14:45 UTC: Further attempts to restore the failed instance are unsuccessful due to the AWS issues, as are attempts to snapshot or reassign the instance’s EBS volume.
14:45 UTC: Another platform instance running in the Cloud Dashboard environment is repurposed in order to take the place of the failed instance.
15:00 UTC: Reconfiguration of the Cloud Dashboard application completes successfully and service is restored.
15:00 UTC and onwards: With the Cloud Dashboard restored Engine Yard Support engineers work with customers to replace affected instances with alternative instances in unaffected AWS AZs.
20:35 UTC: Amazon announce the issue is resolved and service is restored. Certain EC2 instances, RDS instances and EBS volumes remain in a failed state and cannot be restored by Amazon automatically, so EY Support continue to work with customers to restore these.

Incident Root Causes:

From AWS: At 4:33 AM PDT one of ten data centers in one of the six Availability Zones in the US-EAST-1 Region saw a failure of utility power. Our backup generators came online immediately but began failing at around 6:00 AM PDT. This impacted 7.5% of EC2 instances and EBS volumes in the Availability Zone. Power was fully restored to the impacted data center at 7:45 AM PDT. By 10:45 AM PDT, all but 1% of instances had been recovered, and by 12:30 PM PDT only 0.5% of instances remained impaired. Since the beginning of the impact, we have been working to recover the remaining instances and volumes. A small number of remaining instances and volumes are hosted on hardware which was adversely affected by the loss of power. We continue to work to recover all affected instances and volumes and will be communicating to the remaining impacted customers via the Personal Health Dashboard. For immediate recovery, we recommend replacing any remaining affected instances or volumes if possible.
From Engine Yard: The root cause of the issue was quickly identified to be the failure of an AWS instance, responsible for the running of a core component of the EY Cloud Platform . The inability for any other platform instances to communicate with this instance resulted in errors in the platform. It was initially desired to restore the existing instance, or failing that, utilise the snapshot to retain data, but the AWS issues prevented such actions, leaving the only option to be to repurpose another existing instance. Once this action was completed, service was restored.

Incident Impact:

For the period that the Engine Yard Cloud Dashboard was offline no customers were able to view or manage their environments through either the Dashboard or the API, so were unable to make environment or application changes. Running instances outside of the subset of failed instances in the single AZ of US-East-1 were unaffected, so the majority of customer applications were not impacted. For those with instances in the affected AZ, application impact was dependent on the role of the impacted instances, with database and application master instances resulting in application downtime, whilst slaves instances most likely not. EY Support staff worked with customers to restore failed instances where practically possible within the limitations of the AWS issues.

Incident Corrective Actions:

Engine Yard will be working to strengthen the platform environments in order to ensure the highest resilience across all components in order to minimise the disruption from any future infrastructure failures.

Posted Sep 06, 2019 - 07:50 UTC

Resolved

Engine Yard hasn't seen any further occurrence of the problem. AWS has marked as resolved the issues affecting a single AZ on US-East-1.
As such, we're marking the issue as 'Resolved'.

Posted Aug 31, 2019 - 20:48 UTC

Monitoring

The failing component of EY Cloud has now been replaced in order to restore connectivity and service, bringing the Cloud Dashboard back online. Any operations that were being attempted at the time of the failure should be checked and repeated as necessary.
AWS continue to face issues in the single US-East-1 AZ with instances being impaired and EBS volumes experiencing degraded performance, so some customers may still be impacted.
Should you see any issues or require further assistance please submit a support ticket.

Posted Aug 31, 2019 - 15:20 UTC

Identified

The Engine Yard Cloud Dashboard (https://cloud.engineyard.com/) is currently offline due to AWS issues in a particularly availability zone causing connectivity issues for a failing component. Actions are being undertaken to restore service at this time.
Customer's may also see issues of their own due to the connectivity issues which are affecting some instances in a single AZ of the US-EAST-1 region. If you see any issues please submit a support ticket via (https://support.cloud.engineyard.com) or should you be unable to login to your Engine Yard account please contact us in IRC (irc.freenode.net, channel #engineyard).

Posted Aug 31, 2019 - 13:57 UTC

This incident affected: Engine Yard Cloud.