During scheduled network maintenance at 04:00 MST, we applied routing changes to standardize the availability zones we use. Due to a developer error, the subnets in one AZ did not get the proper route to our user index database resource. This resulted in some user indexing jobs to fail. These jobs were configured with infinite retries and as these jobs continued to build up, fail, and get retried, it overwhelmed the queueing database. This caused cascading failures throughout the system as other jobs in the queue were delayed or failed. We mitigated by scaling up the database size. We then fixed the routing issue, as well as removing the infinite retry for the job. This resolved the issue.
Moving forward, we will ensure that infrastructure code reviews better analyze route changes as well as job retry policy. We are also investigating auto-scaling solutions for our queueing database.
We apologize for the degraded performance and appreciate your patience as we continue to improve Pronto for all of our customers.