Crash and Burn
The Skype Network Outage
by: Jerry Liao
The local I.T. industry was surprised with the sudden replacement of Antonio Javier as Managing Director of Microsoft Philippines, that is why not much attention was given to the Skype outage. A problem that rendered most Skype users around the world in the dark. Millions of its users were unable to make phone calls or to send instant messages via the popular Internet-based service.
What exactly happen? Skype has this to say:
“On Thursday, 16th August 2007, the Skype peer-to-peer network became unstable and suffered a critical disruption. The disruption was triggered by a massive restart of our users’ computers across the globe within a very short timeframe as they re-booted after receiving a routine set of patches through Windows Update. The high number of restarts affected Skype’s network resources. This caused a flood of log-in requests, which, combined with the lack of peer-to-peer network resources, prompted a chain reaction that had a critical impact.
Normally Skype’s peer-to-peer network has an inbuilt ability to self-heal, however, this event revealed a previously unseen software bug within the network resource allocation algorithm which prevented the self-healing function from working quickly. Regrettably, as a result of this disruption, Skype was unavailable to the majority of its users for approximately two days.
The issue has now been identified explicitly within Skype. We can confirm categorically that no malicious activities were attributed or that our users’ security was not, at any point, at risk. This disruption was unprecedented in terms of its impact and scope. We would like to point out that very few technologies or communications networks today are guaranteed to operate without interruptions.
We are very proud that over the four years of its operation, Skype has provided a technically resilient communications tool to millions of people worldwide. Skype has now identified and already introduced a number of improvements to its software to ensure that our users will not be similarly affected in the unlikely possibility of this combination of events recurring.
The Skype community of users has been incredibly supportive and we are very grateful for all their good wishes.”
Initially, the breakdown was feared to have been disabled by hackers. Later, it was said that it was the result of a software bug, caused by a massive restart among users who had downloaded a routine Windows patch from Microsoft. To avoid further misunderstanding, Skype answered some questions to clarify as to what really happen:
1. Are we blaming Microsoft for what happened?
We (Skype) don’t blame anyone but ourselves. The Microsoft Update patches were merely a catalyst — a trigger — for a series of events that led to the disruption of Skype, not the root cause of it. And Microsoft has been very helpful and supportive throughout. The high number of post-update reboots affected Skype’s network resources. This caused a flood of log-in requests, which, combined with the lack of peer-to-peer network resources at the time, prompted a chain reaction that had a critical impact. The self-healing mechanisms of the P2P network upon which Skype’s software runs have worked well in the past. Simply put, every single time Skype has needed to recover from reboots that naturally accompany a routine Windows Update, there hasn’t been a problem.
2. What was different about this set of Microsoft update patches?
In short – there was nothing different about this set of Microsoft patches. During a joint call soon after problems were detected, Skype and Microsoft engineers went through the list of patches that had been pushed out. We ruled each one out as a possible cause for Skype’s problems. We also walked through the standard Windows Update process to understand it better and to ensure that nothing in the process had changed from the past (and nothing had). The Microsoft team was fantastic to work with, and after going through the potential causes, it appeared clearer than ever to us that our software’s P2P network management algorithm was not tuned to take into account a combination of high load and supernode rebooting.
3. How come previous Microsoft update patches didn’t cause disruption?
That’s because the update patches were not the cause of the disruption. In previous instances where a large number of supernodes in the P2P network were rebooted, other factors of a “perfect storm” had not been present. That is, there had not been such a combination of high usage load during supernode rebooting. As a result, P2P network resources were allocated efficiently and self-healing worked fast enough to overcome the challenge.
The impact of the said breakdown proves two things: one – the technology is widely accepted and is widely used already, its no longer an alternative and two – the marketplace expects the same level of responsibility and accountability that it demands from a public utility. Users now rely on Skype in the same way as they expect to rely on their phone systems.
Today, the Internet is no longer just a source of information, it is a way of communication – VoIP services, email, newsgroups, IM, and social networking sites are now all part of our lives and this cannot be denied. Service providers should realize this and should do their best in ensuring the availability of these services.