A quick summary
Hi Noodlers. First, I want to apologize for the downtime we had today, and the shorter outages on Friday and Monday mornings. We know that many of your students are trying to work on or finish projects in time for deadlines, and this sort of unexpected downtime really interrupts that and throws a wrench into your lesson plans. My sincere thanks to many of you who communicated with me or Amy via the helpdesk, email, or a phone call — you were all very understanding, knowing that this is also stressful for us!
That said, I want to provide a quick explanation of what has been happening, and what we’ve done and are in the process of doing to prevent it.
The Friday/Monday morning issue
We had about an hour of downtime on both Friday and Monday mornings, starting around 6:30 AM on both days. The issue was unrelated to what happened today — just coincidence that these happened on consecutive days (ugh!). Our team analyzed it carefully with the Amazon cloud server team on Monday afternoon, and the necessary server configuration changes went into place Monday evening. Done and fixed!
Today’s downtime was far more extensive, starting before noon and staying offline for over 6 hours. While we’ve taken the site offline for scheduled maintenance for that long in the past (usually over a vacation), I can’t remember the last time we’ve had an unexpected issue take us offline for that long. Anyway, this one was a challenge to track down. We could see that an Office 365 API call our server makes when it starts up was suddenly and inexplicably failing, so once NoodleTools was down, we weren’t able to start it back up. Our team talked to Microsoft support, but they weren’t aware of the issues on their side. Finally, earlier this evening, their API just stopped spewing errors, and we were able to get things back online without changing a thing on our side.
The takeaway from today: The Office 365 API has never failed like this before (we’ve been running the same code for several years), so we have no reason to expect it to happen again anytime soon. However, we are implementing a change on our side tonight that will allow our server to gracefully handle the same scenario if it does ever happen again — meaning code that will see that the API call is failing and just turn off Office 365 logins temporarily until the API comes back online.
Wrapping it up
That might be more than you wanted to know about everything, but I just wanted to be transparent about what was going on, and give you the peace of mind knowing that we’re on top of all of this and have a plan!
Again, my apologies to you and your students. Here’s to a non-eventful Wednesday!