Many have noted Google’s data processing issues they’ve been having as of late, affecting such services as AdSense, YouTube, and Analytics. What is causing these issues? Evidence suggests it’s BigTable (also known by its newer version’s name Colossus), Google’s distributed filesystem, and MapReduce, Google’s renown bulk-processing algorithm, causing these issues at the I/O level.
Single Points of Failure
AppEngine API users are no stranger to Google’s setup, which they’ve allowed access to by AppEngine users for several years now: version 3 of their core services all use the same infrastructure, spanning multiple redundant and geographically-isolated datacenters across the world, which employs complex custom software for managing all those resources as one big pool abstracted from the applications they support.
While the physical aspect of the many machines composing their technical infrastructure are both trade secret (yet known by many) and outside the scope of this article, the storage system – the cause of these data delays – is not: BigTable, as its name implies, re-visualizes the filesystem as a redundant database, similar to how RAID allows multiple disks to act as one for the same goal.
Many machines, many disks with each (“presumably” of course), and many areas of redundancy for the shared-storage architecture that all services, Google or not, run atop.
The data in question is everything from every realtime recording of clicks for ads, website visits for analytical purposes, YouTube hits, GMail messages, the search index – everything that is nonvolatile is thrown in BigTable/Colossus and smeared across many many machines for redundancy while taking advantage of the performance optimizations a la RAID striping.
And as a sidenote: Infrastructure VP Ben Treynor stated in regards to the 2008 GMail outage that all that data is backed up to a specialized tape backup system. Tape. The thing that puts shit on magnetic strips like from the 80′s. Cool, huh?
Needle in a Haystack
But what processes all that stored data? Enter MapReduce, the library’d algorithm that acts as the Big Damn Sorting Machine for all those masses of bytes.
Without getting into the ComSci specifics of how MapReduce operates (though think of the map() and reduce() functions, scaled up), it is essentially an optimized means of processing massive amounts of data efficiently over short time.
It is used to rank pages relative to a search query in the most realtime of applications, and identify the relevent bits of a domain to build indices as an example of its offline usage. A BigTable requires hefty tools, needless to say.
Putting IT All Together
When every service uses the same optimized and hardened algorithm to input, process and output enormous amounts of realtime data recorded within one of the most massive data stores in history, what could possibly go wrong?
As seen in the image above: shit does.
More specifically, look at the areas of failure here: cross-site sync issues, algorithm update issues, bugs, bandwidth quenching, massive disk failures when infrastructure disks are homogenized to cheap newer models that would happen to have caching issues that don’t play nice with realtime data-crunching software, upgrades gone wrong – you get the picture:
In other words, instead of a single point of failure, such a large and interconnected globally distributed enterprise infrastructure (whew!) can fail catastrophically in a variety of ways – here I just analyze the ways that primarily would affect data processing.
BigTable relies on: entire machines, multiple disks, site syncing, network interconnects, and storage interconnects.
MapReduce relies on: entire machines, operating system function, RAM integrity, site syncing, network interconnects, processors and their available cycles, and application integrity/security.
Putting all this together, taking note of the differing requirements of the two core services relevant to data processing, much can go wrong. The impact, as you can imagine, is devastating to the ad-powered giant.
Every service is affected by this, and as a result Google’s services will suffer. Chris right about now would be hammering the fact into your head that you should sell all Google stock on the next positive delta since Q2/Q3 returns are going to be hampered, but he’s right:
Ads are Google’s backbone, and literally everything they do revolves around serving/targeting ads better (with Steve Jobs’s advice to Google being to consolidate that data under one umbrella: Google+). That ad data is recorded and processed from BigTable using MapReduce, so when these services suffer, Google’s entire entity is threatened.
Just this issue could set them back months, since the ads suffering each represent lost revenue. As stated above, the data is still there and is apparently being recorded properly, but the MapReduce delays are costing the giant revenue per second.
As a guess, the data processing should take place in batches that shan’t be interrupted lest the whole process be nullified: a single collapse of a crucial machine or group machines, or overwhelmed resources could nix a whole batch of output that would normally represent a return for Google.
Bear in mind also that the MapReduce software is really just an algorithm that must be applied to libraries, either static or dynamic, and as such is subject to a software development lifecycle and its pitfalls – possibly per service.
As a result, Google’s data will undoubtedly suffer consequences from these stated issues.
But, in defense of it all, what do you expect? It’s a wonder the services aren’t down more in spite of all that data and processing that powers these powerful services.
Plus, in a company where relevance is absolutely necessary to drive ad click/impression sales for primary revenue in the face of shareholders and investors, each service must compare the operating cost with the return and as such everything from the spinning magnets to the data processing time delta is optimized for speed, efficiency, and reduced operating cost.
On Google’s site, you can watch videos where they discuss the cost of a single query versus the power draw of a single rack of servers with respect to power management and uninterrupted/redundant power for those systems.
So all in all, the service level is still a wonder – but the data processing delays do suck in reference to those who depend on it the most, both inside and outside of Google.
What are your thoughts on this? Have you been affected by the outage? Let us know in the comments.
Mark is a "veteran" (and current) system administrator for a local IT firm in his hometown. He is notorious from his Coffee Desk days as the "funny guy" of the editorial staff, writing some pieces for sheer comic relief to the pleasure of many readers (example). Aside from his priceless humor, he has ample insight in the fields of networking and programming given his years of experience with them, often making quips about his own age in the process. Mark is the oldest member of the editors, and by far the most regular. Contributor, that is. :D