Welcome!

Cloud Expo Authors: Sue Poremba, Pat Romanski, Elizabeth White, Patrick Burke, Jeremy Geelan

Related Topics: Cloud Expo, Open Source, Apache

Cloud Expo: Article

Hadoop and Realtime Cloud Computing

Architectures such as MapReduce and Hadoop are good for batch processing of big data, but bad for realtime processing

Big data is creating a massive disruption for the IT industry. Faced with exponentially growing data volumes in every area of business and the web, companies around the world are looking beyond their current databases and data warehouses for new ways to handle this data deluge.

Taking a lead from Google, a number of organizations have been exploring the potential of MapReduce, and its open source clone Hadoop, for big data processing. The MapReduce/Hadoop approach is based around the idea that what's needed is not database processing with SQL queries, but rather dataflow computing with simple parallel programming primitives such as map and reduce.

As Google and others have shown, this kind of basic dataflow programming model can be implemented as a coarse-grain set of parallel tasks that can be run across hundreds or thousands of machines, to carry out large-scale batch processing on massive data sets.

Google themselves have been using MapReduce for batch processing for over six years, and others, such as Facebook, eBay and Yahoo have been using Hadoop for the same kind of batch processing for several years now. So today, parallel dataflow is firmly established as an alternative to databases and data warehouses for offline batch processing of big data. But now the game is changing again...

In recent months, Google has realized that the web is now entering a new era, the realtime era, and that batch processing systems such as MapReduce and Hadoop cannot deliver performance anywhere near the speed required for new realtime services such as Google Instant. Google noted that

  • "MapReduce isn't suited to calculations that need to occur in near real-time"

and that

  • "You can't do anything with it that takes a relatively short amount of time, so we got rid of it"

Other industry leaders, such as Jeff Jonas, Chief Scientist for Analytics at IBM, have made similar remarks in recent weeks. In his recent video "Big Thoughts on Big Data", Jonas notes that with only batch processing tools to handle it, organizations grappling with a relentless avalanche of realtime data will get dumber over time rather than getting smarter.

  • "The idea of waiting for a batch job to run doesn't cut it. Instead, how can an organization make sense of what it knows, as a transaction is happening, so that it can do something about it right then"
  • "I'm not a big fan of batch processes... I've never seen a batch system grow up an become a realtime streaming system, but you can take a realtime streaming system and make it eat batches all day long"
  • "I like Hadoop but it's meant for batch activities. That's not the kind of back-end you would use for realtime sense-making systems"

So coarse-grain dataflow architectures such as Hadoop are good for batch, but bad for realtime.

To power realtime big data apps we need a completely new type of fine-grain dataflow architecture. An architecture that can, for example, continuously analyze a stream of events at a rate of say one million events per second per server, and deliver results with a maximum latency of five seconds between data in and analytics out. At Cloudscale we set out to crack this major technical problem, and to build the world's first "realtime data warehouse". The linearly scalable Cloudscale parallel dataflow architecture not only delivers game-changing realtime performance on commodity hardware, but also, as Jeff Jonas notes above "can eat batches all day long" like a traditional MapReduce or Hadoop architecture. There isn't really an established name yet for such a system. I guess we could call it a "Redoop" architecture (Realtime Dataflow on Ordinary Processors, or Realtime Hadoop).

More Stories By Bill McColl

Bill McColl left Oxford University to found Cloudscale. At Oxford he was Professor of Computer Science, Head of the Parallel Computing Research Center, and Chairman of the Computer Science Faculty. Along with Les Valiant of Harvard, he developed the BSP approach to parallel programming. He has led research, product, and business teams, in a number of areas: massively parallel algorithms and architectures, parallel programming languages and tools, datacenter virtualization, realtime stream processing, big data analytics, and cloud computing. He lives in Palo Alto, CA.

Cloud Expo Breaking News
Nearly every enterprise is evaluating cloud computing solutions either today or in the near term. Many have already made the leap, and many more are getting close to putting that first toe in the water. But there are key considerations that should be made, questions to be asked, and designs to consider before you can feel secure with your provider. In his session at the 10th International Cloud Expo, David Gulick, Product Manager, Hosting Product Management at Savvis, will help give you food f...
With Cloud Expo 2012 New York (10th Cloud Expo) now under four weeks away, what better time to introduce you in greater detail to the distinguished individuals in our incredible Speaker Faculty for the technical and strategy sessions at the conference... We have technical and strategy sessions for you dealing with every nook and cranny of Cloud Computing, but what of those who are presenting? Who are they, where do they work, what else have they written and/or said about the Cloud that is t...
SYS-CON Events announced today that Super Micro Computer, Inc., a global leader in high-performance, high-efficiency server technology and green computing, will exhibit at SYS-CON's 10th International Cloud Expo, which will take place on June 11–14, 2012, at the Javits Center in New York City, New York. Supermicro (NASDAQ: SMCI), the leading innovator in high-performance, high-efficiency server technology, is a premier provider of advanced server Building Block Solutions for Embedded Systems, E...
SYS-CON Events announced today that ScaleMP, a leading provider of virtualization solutions for high-end computing, will exhibit at SYS-CON's 10th International Cloud Expo, which will take place on June 11–14, 2012, at the Javits Center in New York City, New York. ScaleMP is the leader in virtualization for high-end computing, providing maximum performance and lower total cost of ownership (TCO). The innovative Versatile SMP (vSMP) architecture aggregates multiple independent systems into a sin...
Come learn real-world examples where cloud and mobile are changing the way business works and the impact they're having on efficiency and productivity. In his session at the 10th International Cloud Expo, Rodrigo Coutinho Senior Product Marketing Manager at OutSystems, will look at how mobile and the cloud are interwoven and the wave of change these two 2012 megatrends will bring to your organization. He will also provide a roadmap to assure you can navigate this sea change for business succes...
Enterprise IT organizations want to deploy a virtualized data center fabric that will provide the foundation for agile private cloud computing. Getting there does not have to be difficult, but it does require a new approach to data center infrastructure design – an approach that is non-disruptive, vendor-agnostic, and very adaptable to changing business requirements. In his session at the 10th International Cloud Expo, Bruce Fingles, Chief Information Officer and VP of Product Quality at Xsigo...
With Cloud Expo 2012 New York (10th Cloud Expo) now under four weeks away, what better time to introduce you in greater detail to the distinguished individuals in our incredible Speaker Faculty for the technical and strategy sessions at the conference...
With Cloud Expo 2012 New York (10th Cloud Expo) now under four weeks away, what better time to introduce you in greater detail to the distinguished individuals in our incredible Speaker Faculty for the technical and strategy sessions at the conference... We have technical and strategy sessions for you every day from June 11 through June 14 dealing with every nook and cranny of Cloud Computing and Big Data, but what of those who are presenting? Who are they, where do they work, what else have ...
Whether you are a large enterprise, a growing business, a government organization, or a service provider, Cloud Expo New York is THE place you need to be June 11-14...so you can better understand both the provision and use of the cloud services that increasingly transforming IT and business alike. Cloud Expo covers every aspect of Cloud Computing: performance, integration, security, availability, compliance, purchasing, budgeting, development, visibility, automation...the works!
How can businesses harness the power of APIs to reach new customers and markets? In his session at the 10th International Cloud Expo, Alistair Farquharson, CTO at SOA Software, will walk the audience through the growth and evolution of the API, why effective API management is important, and how the game changes when companies expose business applications to the outside world. He will also discuss: A brief history of the API How to use APIs to make money, save money, build brand "Appificatio...