Welcome!

@CloudExpo Authors: Liz McMillan, Corey Roth, Zakia Bouachraoui, Yeshim Deniz, Dana Gardner

Related Topics: @CloudExpo, Cognitive Computing

@CloudExpo: Blog Post

Google Dumps MapReduce

Batch processing systems like MapReduce and Hadoop are too slow for the new era of "realtime big data"

Over the past five years, MapReduce and Hadoop have been widely used for processing big data from the web, both in-house and in the cloud. However, we are now in an era where news, search, marketing, commerce and many other key aspects of the web are becoming much more social, more mobile, and more realtime. In response to these changes, major web companies are realizing that the "big data analytics" that is driving many of their services needs to be radically changed in order to move it into this realtime era. No company sees this more clearly than Google, the company that originally developed the MapReduce/Hadoop approach to processing big data.

This week, the company unveiled Google Instant, their new realtime search system. Until recently, the indexing system for Google Search was the company's largest MapReduce application. But with the need to move to realtime search, it has now been replaced. As reported here, Google noted that

  • "MapReduce isn't suited to calculations that need to occur in near real-time"

and that

  • "You can't do anything with it that takes a relatively short amount of time, so we got rid of it"

Another article notes

  • "The challenge for Google has been how to support a real-time world when the core of their search technology, the famous MapReduce, is batch oriented. Simple, they got rid of MapReduce... MapReduce still excels as a general query mechanism against masses of data, but real-time search requires a very specialized tool"

We are now at the start of a new era in the big data world. Increasingly, big data apps will need to be realtime. For example, a recent list of "Ten Hadoop-able Problems" contains the following examples of big data problems that can be tackled with MapReduce/Hadoop:

  • Risk Analysis
  • Customer Churn
  • Recommendation Engines
  • Ad Targeting
  • Sales Analysis
  • Network Analysis
  • Fraud Detection
  • Trading Surveillance
  • Search Quality
  • General Data Analytics

In each case, it is clear that these are big data problems where the ability to deliver the results of the analytics in realtime would increase the value of that analytics enormously.

At Cloudscale we've developed the first Realtime Data Warehouse, a new architecture aimed at delivering big data analytics in realtime - with latency in seconds instead of hours. The above ten areas are examples of the kinds of problems that can now be analyzed in realtime. There are, of course, new areas that can be tackled with a Realtime Data Warehouse that are not possible at all using offline batch processing analytics systems such as MapReduce and Hadoop. These include:

  • Realtime Location Analytics
  • Realtime Game Analytics
  • Realtime Algorithmic Trading
  • Realtime Government Intelligence
  • Realtime Sensor Systems and Grids

As we move beyond MapReduce and Hadoop, into this new era of "realtime big data", where analytics apps are "always-on" and run continuously, we can expect to see a major wave of software innovation, with many exciting new realtime apps from developers in areas such as marketing intelligence, social commerce, social enterprise, and the mobile web.

More Stories By Bill McColl

Bill McColl left Oxford University to found Cloudscale. At Oxford he was Professor of Computer Science, Head of the Parallel Computing Research Center, and Chairman of the Computer Science Faculty. Along with Les Valiant of Harvard, he developed the BSP approach to parallel programming. He has led research, product, and business teams, in a number of areas: massively parallel algorithms and architectures, parallel programming languages and tools, datacenter virtualization, realtime stream processing, big data analytics, and cloud computing. He lives in Palo Alto, CA.

CloudEXPO Stories
Daniel Jones is CTO of EngineerBetter, helping enterprises deliver value faster. Previously he was an IT consultant, indie video games developer, head of web development in the finance sector, and an award-winning martial artist. Continuous Delivery makes it possible to exploit findings of cognitive psychology and neuroscience to increase the productivity and happiness of our teams.
When building large, cloud-based applications that operate at a high scale, it's important to maintain a high availability and resilience to failures. In order to do that, you must be tolerant of failures, even in light of failures in other areas of your application. "Fly two mistakes high" is an old adage in the radio control airplane hobby. It means, fly high enough so that if you make a mistake, you can continue flying with room to still make mistakes. In his session at 18th Cloud Expo, Lee Atchison, Principal Cloud Architect and Advocate at New Relic, discussed how this same philosophy can be applied to highly scaled applications, and can dramatically increase your resilience to failure.
Machine learning has taken residence at our cities' cores and now we can finally have "smart cities." Cities are a collection of buildings made to provide the structure and safety necessary for people to function, create and survive. Buildings are a pool of ever-changing performance data from large automated systems such as heating and cooling to the people that live and work within them. Through machine learning, buildings can optimize performance, reduce costs, and improve occupant comfort by sharing information within the building and with outside city infrastructure via real time shared cloud capabilities.
DevOps tends to focus on the relationship between Dev and Ops, putting an emphasis on the ops and application infrastructure. But that’s changing with microservices architectures. In her session at DevOps Summit, Lori MacVittie, Evangelist for F5 Networks, will focus on how microservices are changing the underlying architectures needed to scale, secure and deliver applications based on highly distributed (micro) services and why that means an expansion into “the network” for DevOps.
Is it possible to migrate 100% of your data ecosystem to the cloud? Join Joe Caserta as he takes you on a complete journey to digital transformation mapping out on-prem data footprint and walking it to the cloud. Joe will also explain how the modern ecosystem supports Artificial Intelligence and will include business use cases to back each of his insights.