On September 3rd, our main database instance experienced a network outage. The web application at opbeat.com was offline for 8 minutes and then 23 minutes, including the intake which is the component in our infrastructure which accepts data from modules in client stacks.
We're sorry this happened. Please see the last paragraph to show what we're doing to mitigate this in the future.
Opbeat is entirely hosted on AWS. Times are UTC.
7:01 PM master becomes available again (8 mins): Connectivity to the master comes back. We decide to stick with the current master and hope its a one time blip.
8:33 PM master becomes unavailable: At this point we decide that we will fail over to a replica. There are a few things that slows us down. Some processes did not get restarted correctly and we had to kill them manually.
8:56 PM failover complete and services are back up (23 mins): The fail over completes, load balancers register the web servers as up.
During the master database outages, opbeat.com as well as our intake was unavailable. That means data sent to us was not being accepted or stored and it was impossible to consume the information already sent to Opbeat.
Our main database runs PostgreSQL and PostgreSQL is an extraordinarily solid piece of software, that we'll continue to use. Unfortunately, machines fail. So while we'll never be able to completely secure ourselves against this kind of incident, there are steps to we can take to ensure that we'll fail over quicker. We also have ideas for how the Intake can stay available, despite this kind of outage which we're now investigating. We've reached out to AWS to understand what caused the connectivity issue in the first place, but they have been unable to find the cause.