@CloudExpo Authors: Pat Romanski, Liz McMillan, Elizabeth White, Yeshim Deniz, Zakia Bouachraoui

Related Topics: @CloudExpo, Open Source Cloud

@CloudExpo: Blog Feed Post

Are Hadoop's Days Numbered?

GigOm says that since Hadoop is disk/ETL/batch based it won’t fit for real time processing of frequently changing data

Interesting article at GigaOm: http://bit.ly/OINpfr I won’t repeat the main points – but basically it says that since Hadoop is disk/ETL/batch based it won’t fit for real time processing of frequently changing data. Author correctly points out that real time processing (i.e. perceptual real time meaning sub-second to few seconds response time) is becoming a HUGE trend that’s impossible to ignore. He points to Google that moved away from Hadoop MapReduce-like approach towards massively distributed in-memory platform for its various projects like Precolator and Dremel…

So, What’s New?!

The widespread confusion about Hadoop’s role and its applicability is becoming alarming… Hadoop was never designed to process anything in real time or process live streaming data or process anything that’s rapidly changing. Hadoop’s core is HDFS technology – a highly scalable distributed file system that works on spinning disks and supports effective batch storing and accessing data. It is an excellent data warehouse technology that scales to petabytes of data on commodity hardware. And Hadoop does an excellent job at this.

Now, Hadoop eco-system also has MapReduce (and various satellite projects like Pig, Hive, etc.). Hadoop’s implementation of MapReduce (as well as Pig, Hive, HBase, etc.) “suffers” from exactly the same limitation – it works over HDFS and therefore is architecturally a batch & disk oriented. Let me repeat it again – Hadoop MapReduce was never meant to processing anything in real time or work on live streaming data. Period, end of story. It was designed to work over datasets stored in disk-based HDFS – and it does so very well.

Are They Really Numbered?

I don’t see anything on the horizon that would displace Hadoop HDFS. There’s a clear business use case & demand for massive disk-based storage on petabyte/exabyte scale – and Hadoop HDFS is a clear industry choice today. Hadoop HDFS is here to stay for a long time…

But as Gartner’s Merv Adrian says the Big Data has two sides to its coin: storage and processing. Hadoop HDFS provides excellent storage technology but its processing side isn’t as shiny. As I (and many others) have mentioned Hadoop MapReduce is bound to live by limitations of HDFS – batch &  offline oriented disk-based processing. Some companies will be content with that limitations (and for many it is just fine). Others – will follow Facebook, Google and Twitter in moving away from disk-based, offline processing towards real time in-memory data platforms.

What is very important to understand is that move to in-memory processing isn’t about the raw speed only (although the RAM access is up to 10,000,000 times faster than disk). What’s more important is that when you keep your working set in memory it enables a complete new family of algorithms that you can employ. Incremental indexing (Google’s Precolator, GridGain’s Data Grid), streaming MapReduce/CEP (GridGain’s Compute Grid, Twitter’s Storm), etc. – all of these are not something that Hadoop engineers just didn’t know about – it is rather something that is largely enabled by in-memory technology.

Naturally, in-memory based technologies don’t invalidate the need for Hadoop HDFS, the proverbial data warehouse. In many cases (but not all) HDFS can happily coexist with something like GridGain that provides native upstream and downstream integration with HDFS enabling you to do streaming MapReduce/CEP processing on data in HDFS – among many other things.

To sum up my thoughts I believe and hope that Hadoop HDFS is there to stay and we’ll see more and more companies moving away from disk-based processing towards all kinds of in-memory based technologies.

Read the original blog entry...

More Stories By Thomas Krafft

Over 15 years of experience in marketing and demand creation, with strategies driving over $500 million in revenue for a variety of companies in several high-growth and competitive markets, including consumer software and web services, ecommerce, demand creation through web and search, big data, and now healthcare.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.

CloudEXPO Stories
"We host and fully manage cloud data services, whether we store, the data, move the data, or run analytics on the data," stated Kamal Shannak, Senior Development Manager, Cloud Data Services, IBM, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
Enterprise architects are increasingly adopting multi-cloud strategies as they seek to utilize existing data center assets, leverage the advantages of cloud computing and avoid cloud vendor lock-in. This requires a globally aware traffic management strategy that can monitor infrastructure health across data centers and end-user experience globally, while responding to control changes and system specification at the speed of today’s DevOps teams. In his session at 20th Cloud Expo, Josh Gray, Chief Architect at Cedexis, covered strategies for orchestrating global traffic achieving the highest-quality end-user experience while spanning multiple clouds and data centers and reacting at the velocity of modern development teams.
In this Women in Technology Power Panel at 15th Cloud Expo, moderated by Anne Plese, Senior Consultant, Cloud Product Marketing at Verizon Enterprise, Esmeralda Swartz, CMO at MetraTech; Evelyn de Souza, Data Privacy and Compliance Strategy Leader at Cisco Systems; Seema Jethani, Director of Product Management at Basho Technologies; Victoria Livschitz, CEO of Qubell Inc.; Anne Hungate, Senior Director of Software Quality at DIRECTV, discussed what path they took to find their spot within the technology industry and how do they see opportunities for other women in their area of expertise.
To Really Work for Enterprises, MultiCloud Adoption Requires Far Better and Inclusive Cloud Monitoring and Cost Management … But How? Overwhelmingly, even as enterprises have adopted cloud computing and are expanding to multi-cloud computing, IT leaders remain concerned about how to monitor, manage and control costs across hybrid and multi-cloud deployments. It’s clear that traditional IT monitoring and management approaches, designed after all for on-premises data centers, are falling short in this new hybrid and dynamic environment.
When applications are hosted on servers, they produce immense quantities of logging data. Quality engineers should verify that apps are producing log data that is existent, correct, consumable, and complete. Otherwise, apps in production are not easily monitored, have issues that are difficult to detect, and cannot be corrected quickly. Tom Chavez presents the four steps that quality engineers should include in every test plan for apps that produce log output or other machine data. Learn the steps so your team's apps not only function but also can be monitored and understood from their machine data when running in production.