Engineers mess up causing Microsoft Azure downtime

Microsoft engineers make a roll-out mistake, costing Azure some downtime.

Chris Smith

News Reporter and eSports/Audio/Mobile Devices Editor

Published Dec 18, 2014 1:48 AM CST
Updated Nov 3, 2020 12:11 PM CST

1 minute & 15 seconds read time

Due to gaps in the deployment policies produced by engineers, Microsoft's Azure cloud service was taken offline during a period of time throughout November 2014. This information has been discovered thanks to a detailed mea culpa analysis by Microsoft themselves.

Engineers mess up causing Microsoft Azure downtime | TweakTown.com

Jason Zander, Azure team member, conducted a final root cause analysis (RCA) and published it recently, claiming that the engineers intended to push software changes to improve performance and reduce processor load of the services' front-end system. However an outage was caused, meaning customers being unable to connect to Azure's storage, virtual machine, website, Active Directory or management portal functions.

The coding succeeded well in improving performance in the testing phases, however the full roll-out was discovered to encounter two main issues. Usually Microsoft deploys these updates in waves, slowly increasing the updated infrastructures bit by bit rather than a full roll-out. However an engineer saw this update as a low risk exercise after a small testing phase and pushed it to everyone in one hit. Thanks to this blunder and subsequent outage, Microsoft are heavily enforcing staged deployments from now on.

The second main mistake was explained by iTnews as leading "to the software change being wrongly enabled on Azure Blob (binary large object) storage front-ends when it had only been tested against table storage front-ends. This exposed a bug that caused some Blob storage front-ends being stuck in infinite loops, and ceasing to respond to requests."

It seems that Microsoft has learned from their mistakes and here's hoping the engineer still has a job to feed his family and lives to work another day. Alongside these two errors rendering their online service useless for many, Microsoft have further blamed poor communications during the outage as part of another issue. Further stating that tweets by the @Azure Twitter account and their live blogs didn't inform consumers well enough of quick updates.