We're experiencing another major database cluster connectivity issue
Incident Report for Opbeat

Postmortem of outage of September 3rd

On September 3rd, our main database instance experienced a network outage. The web application at opbeat.com was offline for 8 minutes and then 23 minutes, including the intake which is the component in our infrastructure which accepts data from modules in client stacks.

We're sorry this happened. Please see the last paragraph to show what we're doing to mitigate this in the future.

Timeline

Opbeat is entirely hosted on AWS. Times are UTC.

  • 6:53 PM first page: We immediately start discussing what's going on. It quickly becomes apparent that our master database is unavailable. We start preparation to failover to a replica.
  • 7:01 PM master becomes available again (8 mins): Connectivity to the master comes back. We decide to stick with the current master and hope its a one time blip.

  • 8:33 PM master becomes unavailable: At this point we decide that we will fail over to a replica. There are a few things that slows us down. Some processes did not get restarted correctly and we had to kill them manually.

  • 8:56 PM failover complete and services are back up (23 mins): The fail over completes, load balancers register the web servers as up.

Impact

During the master database outages, opbeat.com as well as our intake was unavailable. That means data sent to us was not being accepted or stored and it was impossible to consume the information already sent to Opbeat.

Remediation

Our main database runs PostgreSQL and PostgreSQL is an extraordinarily solid piece of software, that we'll continue to use. Unfortunately, machines fail. So while we'll never be able to completely secure ourselves against this kind of incident, there are steps to we can take to ensure that we'll fail over quicker. We also have ideas for how the Intake can stay available, despite this kind of outage which we're now investigating. We've reached out to AWS to understand what caused the connectivity issue in the first place, but they have been unable to find the cause.

Posted about 2 years ago. Sep 11, 2015 - 15:57 UTC

Resolved
This incident has been resolved. Postmorten incoming, after we've digested the information.
Posted about 2 years ago. Sep 04, 2015 - 00:06 UTC
Monitoring
We failed over to our backup database cluster and are monitoring the situation
Posted about 2 years ago. Sep 03, 2015 - 20:55 UTC
Investigating
Currently investigating.
Posted about 2 years ago. Sep 03, 2015 - 20:40 UTC