Welcome!

Cloud Expo Authors: Jeremy Geelan, Roger Strukhoff, Liz McMillan, Elizabeth White, Carmen Gonzalez

Related Topics: CMS, Web 2.0

CMS: Article

100 Minutes Without Gmail: Google Explains Outage

"We've turned our full attention to helping ensure this kind of event doesn't happen again"

"We've turned our full attention to helping ensure this kind of event doesn't happen again," wrote Ben Treynor, self-described 'VP Engineering and Site Reliability Czar' at Google, in an official blogged explanation last night of the 100-minute Gmail outage yesterday, which Treynor conceded "was a Big Deal, and we're treating it as such."

He then described in detail the events that conspired to bring about the outage:

"Here's what happened: This morning (Pacific Time) we took a small fraction of Gmail's servers offline to perform routine upgrades. This isn't in itself a problem — we do this all the time, and Gmail's web interface runs in many locations and just sends traffic to other locations when one is offline.

However, as we now know, we had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers — servers which direct web queries to the appropriate Gmail server for response. At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system "stop sending us traffic, we're too slow!". This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded. As a result, people couldn't access Gmail via the web interface because their requests couldn't be routed to a Gmail server. IMAP/POP access and mail processing continued to work normally because these requests don't use the same routers.

The Gmail engineering team was alerted to the failures within seconds (we take monitoring very seriously). After establishing that the core problem was insufficient available capacity, the team brought a LOT of additional request routers online (flexible capacity is one of the advantages of Google's architecture), distributed the traffic across the request routers, and the Gmail web interface came back online."

Treynor's post ended with a detailed explanation of Google's plans to prevent a repeat of the same problem:

"What's next ... Some of the actions are straightforward and are already done — for example, increasing request router capacity well beyond peak demand to provide headroom. Some of the actions are more subtle — for example, we have concluded that request routers don't have sufficient failure isolation (i.e. if there's a problem in one datacenter, it shouldn't affect servers in another datacenter) and do not degrade gracefully (e.g. if many request routers are overloaded simultaneously, they all should just get slower instead of refusing to accept traffic and shifting their load). We'll be hard at work over the next few weeks implementing these and other Gmail reliability improvements."

He ends, "Gmail remains more than 99.9% available to all users, and we're committed to keeping events like today's notable for their rarity."

More Stories By Jeremy Geelan

Jeremy Geelan is Sr. Vice-President of SYS-CON Media & Events. He is Conference Chair of the worldwide Cloud Expo series, of the Virtualization Conference series, and of the uppcoming UlitzerLIVE! event. He's founder of Cloud Computing Journal, Web 2.0 Journal, AJAX & RIA Journal and other leading SYS-CON titles. From 2000-6, as first editorial director and then group publisher of SYS-CON Media, he was responsible for the development of all new titles and i-Technology portals for the firm. Today he has complete responsibility for the content of SYS-CON's entire portfolio of Events. He regularly represents SYS-CON Media & Events at conferences and trade shows, speaking to technology audiences both in North America and overseas. He is executive producer and presenter of "Power Panels with Jeremy Geelan" on SYS-CON.TV.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


Cloud Expo Breaking News
A-Server, a specialist in datacenter virtualization, will launch a new version of its Datacenter-as-a-Service platform at SYS-CON's 5th International Cloud Computing Expo, which will take place on April 19-21, 2010, at t...
No one can properly understand anything related to enterprise-level Cloud Computing without having first gained a deep understanding of the capabilities of different Cloud players. SYS-CON's pioneering Cloud Computing Bo...
"Cloud" has become synonymous with "computing" and "software" in two short years. Cloud Expo is the new PC Expo, Comdex, and InternetWorld of our decade. By 2012, more than 50,000 delegates per year will participate in C...
No one can properly understand anything related to enterprise-level Cloud Computing without having first gained a deep understanding of the capabilities of different Cloud players. SYS-CON's pioneering Cloud Computing Bo...
SYS-CON Events announced today that Objectivity, a leading provider of scalable database management solutions for mission-critical, real-time and distributed applications, has been named “Bronze Sponsor” of SYS-CON's 5th...