Database node failures

Incident Report for Pronto

Postmortem

A postmortem

Posted Oct 14, 2021 - 07:35 MDT

Resolved

Just after 8:00pm MDT on Sep 26th the Pronto database cluster had a simultaneous failure on multiple nodes. The Pronto database cluster is designed to automatically withstand the loss of individual nodes that happen in succession, but not when multiple happen simultaneously as they did in this case. The Operations team at Pronto immediately engaged multiple avenues of support at both our hosting provider and our database vendor. In the meantime they also prepared to perform a full database restore (something that is tested regularly, including one last week).

After some time, our hosting provider alerted us to an underlying service failure on their part that resulted in the node failures. They worked to restore services, but this took several hours. After our hosting provider’s fix, Pronto services began to come back online at about 1:15am MDT on Sep 27th and were working normally by 1:30am.

We are extremely sorry for the disruption to Pronto services. We will learn from this incident and work towards improving our services to be more resilient to underlying failures like this in the future.

Posted Sep 26, 2021 - 20:00 MDT