Simple router IP address change brought Microsoft offline – Digital Journal


Microsoft is reportedly in talks to invest $10 billion in OpenAI to challenge world-dominating search engine Google. — © AFP

According to Microsoft, a change made to the Microsoft Wide Area Network (WAN) made Microsoft services inaccessible to users around the world. The network outage brought down the Azure cloud platform along with business services including Teams and Outlook.

This recent Microsoft outage caused significant concerns among commercial users. The subject presents ongoing problems for network operations equipment to resolve in order to prevent future network outages and be prepared to minimize future downtime.

To understand the meaning, Digital magazine met with Josh Stephens, who has been involved in network engineering for over 30 years, first as an engineer in the US Air Force and most recently as CTO of BackBox.

Stephens begins by expressing his disbelief over the incident: “It’s unbelievable that even the simplest configuration change or even a typo can sometimes cause a domino effect and bring down a network or disrupt a supposedly fault-tolerant commercial service. Even tech giants like Microsoft are not immune.”

Stephens continues: “In many cases, the outage may not occur immediately after the configuration change is made, and therefore it may be difficult to correlate the change during root cause analysis.”

Delving into the core issue, Stephens notes: “While many news reports have focused on the fact that a configuration change caused such widespread outage, the real headline is that it took them four hours to restore service.”

Stephens adds: “While this sounds exorbitant, without further technical details about the cause of the outage and, more specifically, the extenuating circumstances that prolonged the time it took to restore service, rather than pass judgement, I’ll just say honestly, I’ve been there.”

In terms of lessons to be learned, Stephens has ideas for how other network teams in organizations can be proactive now to avoid a similar disaster.

His first recommendation is: “Accelerate the speed of resolution of difficult technical problems to ensure that there is solid documentation and up-to-date network maps”.

Second, Stephens advises: “Continuous, automated configuration auditing and remediation to ensure that all network devices are up-to-date and compliant with operational policies and industry standards.”

Stephens also adds, as his third consideration, that automated network configuration backups “allow you to instantly restore backups and have weekly or frequent automated OS updates and patches.”

Stephens’ fourth and final recommendation is: “At a minimum, your automation platform should create daily backups, before and after changes, and store a long history of backups within a fault-tolerant data store and automatically scalable. In addition, it should be able to reliably perform updates at scale while employing at least mildly complex workflows.” While these approaches are useful, Stephens cautions, “No single tool or approach can guarantee business continuity, but there are ways to be better prepared if the worst happens.”

Add Comment