AWS outage
Incident Report for Pronto
All Pronto services are now back to normal. Push notifications are now being delivered in real-time and other async jobs such as URL previews are speedy once again. Canvas integration has also been re-enabled. Canvas course syncing will need some time to catch up, but should be up to date for all customers within the next 6 hours. Thank you for your patience today. We will spend some time analyzing this event to see what changes we can make to be more resilient to a similar failure in the future.
Posted Dec 07, 2021 - 18:36 MST
AWS has implemented their root cause mitigation plan and core Pronto services are once again working well. We are still experiencing some minor latency with push notifications as scaling on that service has not yet been restored by AWS engineers. Canvas integration is also still disabled for the same reason. We are hopeful that these issues will both be resolved quickly.
Posted Dec 07, 2021 - 17:07 MST
As AWS starts to see significant recovery, we also are seeing some Pronto services scaling up again. Push notifications are still delayed, but response times are improving on the core Pronto services. Canvas integration is still disabled. We will continue to provide updates as services recover.
Posted Dec 07, 2021 - 15:29 MST
We just saw a major increase in traffic from an integration platform, perhaps as it itself was recovering. This caused our small cluster to get overloaded. To mitigate we have temporarily disabled the Canvas integration platform until we are once again able to scale the Pronto services. This mitigation appears to have worked and Pronto core services are back up, albeit with slower response times than normal.
Posted Dec 07, 2021 - 14:24 MST
We are continuing to work on a fix for this issue.
Posted Dec 07, 2021 - 14:17 MST
As expected, traffic increases finally pushed Pronto over the edge and we are now experiencing a system wide outage due to our inability to scale because of the AWS outage. We will continue to do whatever we can within our power to bring Pronto back up. We sincerely apologize for the disruption we know this is causing you.
Posted Dec 07, 2021 - 14:05 MST
AWS says they are starting to see some signs of recovery, but do not have an ETA for full recovery at this time. We have tried various ways to scale Pronto servers, but because AWS internal APIs are failing this has not been successful. Thus, Pronto is currently running on less than half the capacity we normally would at this time of day. Push notifications continue to be delayed, and general response times are increasing. We expect that if AWS has not recovered their services in the next hour we will start to see much higher latency and an increase in error rates on Pronto core services. We will continue to explore alternatives in the meantime and will keep you up to date. Thanks for your patience.
Posted Dec 07, 2021 - 12:15 MST
AWS has identified the root cause and are working towards recovery. Pronto core services are still running smoothly for now (except for delays in push notifications and other async jobs as noted in the last update), but because of the outage we are unable to automatically or manually scale up our servers as we normally would. As traffic increases in the next couple of hours this could result in slower response times across Pronto services. We are investigating alternative ways to scale up our servers in the meantime and will continue to keep you updated.
Posted Dec 07, 2021 - 11:15 MST
There seems to be problems in the us-east-1 AWS region resulting in some services being slow or having increased error rates. Core Pronto services are not currently impacted, but push notifications and other async jobs such as URL previews may be delayed. We are monitoring the situation and will post updates as we learn more. AWS status is available here:
Posted Dec 07, 2021 - 10:48 MST
This incident affected: Pronto Platform and Pronto Web App.