Welcome!

@CloudExpo Authors: Elizabeth White, Yeshim Deniz, Rishi Bhargava, Liz McMillan, Dana Gardner

Related Topics: @CloudExpo, Open Source Cloud, Apache

@CloudExpo: Article

Hadoop and Realtime Cloud Computing

Architectures such as MapReduce and Hadoop are good for batch processing of big data, but bad for realtime processing

Big data is creating a massive disruption for the IT industry. Faced with exponentially growing data volumes in every area of business and the web, companies around the world are looking beyond their current databases and data warehouses for new ways to handle this data deluge.

Taking a lead from Google, a number of organizations have been exploring the potential of MapReduce, and its open source clone Hadoop, for big data processing. The MapReduce/Hadoop approach is based around the idea that what's needed is not database processing with SQL queries, but rather dataflow computing with simple parallel programming primitives such as map and reduce.

As Google and others have shown, this kind of basic dataflow programming model can be implemented as a coarse-grain set of parallel tasks that can be run across hundreds or thousands of machines, to carry out large-scale batch processing on massive data sets.

Google themselves have been using MapReduce for batch processing for over six years, and others, such as Facebook, eBay and Yahoo have been using Hadoop for the same kind of batch processing for several years now. So today, parallel dataflow is firmly established as an alternative to databases and data warehouses for offline batch processing of big data. But now the game is changing again...

In recent months, Google has realized that the web is now entering a new era, the realtime era, and that batch processing systems such as MapReduce and Hadoop cannot deliver performance anywhere near the speed required for new realtime services such as Google Instant. Google noted that

  • "MapReduce isn't suited to calculations that need to occur in near real-time"

and that

  • "You can't do anything with it that takes a relatively short amount of time, so we got rid of it"

Other industry leaders, such as Jeff Jonas, Chief Scientist for Analytics at IBM, have made similar remarks in recent weeks. In his recent video "Big Thoughts on Big Data", Jonas notes that with only batch processing tools to handle it, organizations grappling with a relentless avalanche of realtime data will get dumber over time rather than getting smarter.

  • "The idea of waiting for a batch job to run doesn't cut it. Instead, how can an organization make sense of what it knows, as a transaction is happening, so that it can do something about it right then"
  • "I'm not a big fan of batch processes... I've never seen a batch system grow up an become a realtime streaming system, but you can take a realtime streaming system and make it eat batches all day long"
  • "I like Hadoop but it's meant for batch activities. That's not the kind of back-end you would use for realtime sense-making systems"

So coarse-grain dataflow architectures such as Hadoop are good for batch, but bad for realtime.

To power realtime big data apps we need a completely new type of fine-grain dataflow architecture. An architecture that can, for example, continuously analyze a stream of events at a rate of say one million events per second per server, and deliver results with a maximum latency of five seconds between data in and analytics out. At Cloudscale we set out to crack this major technical problem, and to build the world's first "realtime data warehouse". The linearly scalable Cloudscale parallel dataflow architecture not only delivers game-changing realtime performance on commodity hardware, but also, as Jeff Jonas notes above "can eat batches all day long" like a traditional MapReduce or Hadoop architecture. There isn't really an established name yet for such a system. I guess we could call it a "Redoop" architecture (Realtime Dataflow on Ordinary Processors, or Realtime Hadoop).

More Stories By Bill McColl

Bill McColl left Oxford University to found Cloudscale. At Oxford he was Professor of Computer Science, Head of the Parallel Computing Research Center, and Chairman of the Computer Science Faculty. Along with Les Valiant of Harvard, he developed the BSP approach to parallel programming. He has led research, product, and business teams, in a number of areas: massively parallel algorithms and architectures, parallel programming languages and tools, datacenter virtualization, realtime stream processing, big data analytics, and cloud computing. He lives in Palo Alto, CA.

@CloudExpo Stories
SYS-CON Events announced today that Auditwerx will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Auditwerx specializes in SOC 1, SOC 2, and SOC 3 attestation services throughout the U.S. and Canada. As a division of Carr, Riggs & Ingram (CRI), one of the top 20 largest CPA firms nationally, you can expect the resources, skills, and experience of a much larger firm combined with the accessibility and attent...
SYS-CON Events announced today that Technologic Systems Inc., an embedded systems solutions company, will exhibit at SYS-CON's @ThingsExpo, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Technologic Systems is an embedded systems company with headquarters in Fountain Hills, Arizona. They have been in business for 32 years, helping more than 8,000 OEM customers and building over a hundred COTS products that have never been discontinued. Technologic Systems’ pr...
SYS-CON Events announced today that CA Technologies has been named “Platinum Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY, and the 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CA Technologies helps customers succeed in a future where every business – from apparel to energy – is being rewritten by software. From ...
SYS-CON Events announced today that HTBase will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. HTBase (Gartner 2016 Cool Vendor) delivers a Composable IT infrastructure solution architected for agility and increased efficiency. It turns compute, storage, and fabric into fluid pools of resources that are easily composed and re-composed to meet each application’s needs. With HTBase, companies can quickly prov...
Imagine having the ability to leverage all of your current technology and to be able to compose it into one resource pool. Now imagine, as your business grows, not having to deploy a complete new appliance to scale your infrastructure. Also imagine a true multi-cloud capability that allows live migration without any modification between cloud environments regardless of whether that cloud is your private cloud or your public AWS, Azure or Google instance. Now think of a world that is not locked i...
What if you could build a web application that could support true web-scale traffic without having to ever provision or manage a single server? Sounds magical, and it is! In his session at 20th Cloud Expo, Chris Munns, Senior Developer Advocate for Serverless Applications at Amazon Web Services, will show how to build a serverless website that scales automatically using services like AWS Lambda, Amazon API Gateway, and Amazon S3. We will review several frameworks that can help you build serverle...
SYS-CON Events announced today that Loom Systems will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Founded in 2015, Loom Systems delivers an advanced AI solution to predict and prevent problems in the digital business. Loom stands alone in the industry as an AI analysis platform requiring no prior math knowledge from operators, leveraging the existing staff to succeed in the digital era. With offices in S...
In his session at Cloud Expo, Alan Winters, an entertainment executive/TV producer turned serial entrepreneur, will present a success story of an entrepreneur who has both suffered through and benefited from offshore development across multiple businesses: The smart choice, or how to select the right offshore development partner Warning signs, or how to minimize chances of making the wrong choice Collaboration, or how to establish the most effective work processes Budget control, or how to max...
MongoDB Atlas leverages VPC peering for AWS, a service that allows multiple VPC networks to interact. This includes VPCs that belong to other AWS account holders. By performing cross account VPC peering, users ensure networks that host and communicate their data are secure. In his session at 20th Cloud Expo, Jay Gordon, a Developer Advocate at MongoDB, will explain how to properly architect your VPC using existing AWS tools and then peer with your MongoDB Atlas cluster. He'll discuss the secur...
Buzzword alert: Microservices and IoT at a DevOps conference? What could possibly go wrong? In this Power Panel at DevOps Summit, moderated by Jason Bloomberg, the leading expert on architecting agility for the enterprise and president of Intellyx, panelists peeled away the buzz and discuss the important architectural principles behind implementing IoT solutions for the enterprise. As remote IoT devices and sensors become increasingly intelligent, they become part of our distributed cloud enviro...
SYS-CON Events announced today that T-Mobile will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. As America's Un-carrier, T-Mobile US, Inc., is redefining the way consumers and businesses buy wireless services through leading product and service innovation. The Company's advanced nationwide 4G LTE network delivers outstanding wireless experiences to 67.4 million customers who are unwilling to compromise on ...
SYS-CON Events announced today that Infranics will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Since 2000, Infranics has developed SysMaster Suite, which is required for the stable and efficient management of ICT infrastructure. The ICT management solution developed and provided by Infranics continues to add intelligence to the ICT infrastructure through the IMC (Infra Management Cycle) based on mathemat...
@DevOpsSummit has been named the ‘Top DevOps Influencer' by iTrend. iTrend processes millions of conversations, tweets, interactions, news articles, press releases, blog posts - and extract meaning form them and analyzes mobile and desktop software platforms used to communicate, various metadata (such as geo location), and automation tools. In overall placement, @DevOpsSummit ranked as the number one ‘DevOps Influencer' followed by @CloudExpo at third, and @MicroservicesE at 24th.
SYS-CON Events announced today that Interoute, owner-operator of one of Europe's largest networks and a global cloud services platform, has been named “Bronze Sponsor” of SYS-CON's 20th Cloud Expo, which will take place on June 6-8, 2017 at the Javits Center in New York, New York. Interoute is the owner-operator of one of Europe's largest networks and a global cloud services platform which encompasses 12 data centers, 14 virtual data centers and 31 colocation centers, with connections to 195 add...
SYS-CON Events announced today that Cloudistics, an on-premises cloud computing company, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Cloudistics delivers a complete public cloud experience with composable on-premises infrastructures to medium and large enterprises. Its software-defined technology natively converges network, storage, compute, virtualization, and management into a ...
In his General Session at 16th Cloud Expo, David Shacochis, host of The Hybrid IT Files podcast and Vice President at CenturyLink, investigated three key trends of the “gigabit economy" though the story of a Fortune 500 communications company in transformation. Narrating how multi-modal hybrid IT, service automation, and agile delivery all intersect, he will cover the role of storytelling and empathy in achieving strategic alignment between the enterprise and its information technology.
Microservices are a very exciting architectural approach that many organizations are looking to as a way to accelerate innovation. Microservices promise to allow teams to move away from monolithic "ball of mud" systems, but the reality is that, in the vast majority of organizations, different projects and technologies will continue to be developed at different speeds. How to handle the dependencies between these disparate systems with different iteration cycles? Consider the "canoncial problem" ...
In his session at 20th Cloud Expo, Scott Davis, CTO of Embotics, will discuss how automation can provide the dynamic management required to cost-effectively deliver microservices and container solutions at scale. He will discuss how flexible automation is the key to effectively bridging and seamlessly coordinating both IT and developer needs for component orchestration across disparate clouds – an increasingly important requirement at today’s multi-cloud enterprise.
The essence of cloud computing is that all consumable IT resources are delivered as services. In his session at 15th Cloud Expo, Yung Chou, Technology Evangelist at Microsoft, demonstrated the concepts and implementations of two important cloud computing deliveries: Infrastructure as a Service (IaaS) and Platform as a Service (PaaS). He discussed from business and technical viewpoints what exactly they are, why we care, how they are different and in what ways, and the strategies for IT to transi...
The Internet of Things is clearly many things: data collection and analytics, wearables, Smart Grids and Smart Cities, the Industrial Internet, and more. Cool platforms like Arduino, Raspberry Pi, Intel's Galileo and Edison, and a diverse world of sensors are making the IoT a great toy box for developers in all these areas. In this Power Panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, panelists discussed what things are the most important, which will have the most profound e...