|By Bob Gourley||
|December 13, 2014 12:00 PM EST||
Since the birth of Hadoop in 2005-06, the way we think about storing and processing information has evolved considerably. The term “Big Data” has become synonymous with this evolution. But still, many of our customers continue to ask, “What is Big Data?”, “What are its use cases?”, and “What is its business value?”. The Internet is overloaded with definitions, characteristics, and benefits; however, few discussions synthesize all three of these topics in one place. This paper answers these questions, and proposes a total cost calculation framework for CTOs and CIOs that are evaluating solutions for their organization’s use case(s). In the text below, I examine an on-premise Hadoop ecosystem as a general purpose Big Data solution in relation to alternative commercial purpose-built storage technologies (-e.g. Oracle, Teradata, IBM, SAP, Microsoft, EMC, etc). It may be difficult to determine the exact point at which you should leverage one over the other. It is my contention that when the total cost of using all your data exceeds what you are able to spend using purpose-built technologies, it is time to consider using a general purpose solution like Hadoop for process offloading.
What is Big Data?
According to Edd Dumbill, a well respected thought leader and VP of Strategy for Silicon Valley Data Science, a big data and data science consulting company, Big Data is “data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or does not fit the structures of your database architectures. To gain value from this data, you must choose an alternative way to process it.” This definition was published in an article entitled “What is Big Data?” in Big Data Now: 2012 Edition by O’Reilly Media, and touches on the three primary characteristics of data :
- Volume: The size of your data. Data has mass, and there is a cost to moving it around the network.
- Velocity: The speed with which the data either arrives or is created, and how quickly it needs to be consumed in order to make use of it.
- Variety: The differences in structure between all the types of data within an enterprise.
Scour the internet and you will find that there are other, less commonly discussed but relevant characteristics associated with Big Data to include Veracity and Volatility . Veracity refers to the truthfulness of the data or your degree of trust in what it is conveying. Volatility is how often your existing data changes or is updated by the new data you are receiving/creating. There are certainly other characteristics as well. I recently began using the term Viscosity to describe the degree of data fragmentation in client environments, and the level of effort required to reassemble it into a coherent view. In this context, organizations with low viscosity have significant fragmentation, and duplication throughout their enterprise.
The term “Big Data” has come to focus on these characteristics and imply that traditional database architectures, such as on line transaction processing (OLTP) and on line analytic processing (OLAP) purpose-built technologies simply will not scale to meet your data capacity needs. However, massively parallel processing (MPP) database architectures are one example of purpose-built technologies that have been developed to support both OLTP and OLAP data structures at enormous scales up into the petabytes (PB). For example, there is a 50 PB MPP cluster at eBay . Certainly this size conforms to any logical definition of Big Data.
Depending on your use case, it is possible that a purpose-built technology may suite your needs at scale. While MPP systems remain most effective with structured, tabular and transactional data sets, it is possible to store most everything except massive files in relational structures. However, this may not be the best fit for your use case(s) in terms of appropriateness or cost. There is limited published pricing data for commercial offerings, but MPP systems are notably expensive. When including the cost of software, hardware, and licensing/support, the cost per terabyte (TB) of an MPP system is estimated at tens of thousands of dollars . At these prices, a one PB system can cost tens of millions of dollars. In contrast, the equivalent cost of Hadoop is roughly $2,000 per TB, leaving a one PB Hadoop cluster to cost roughly $2 million. That is a significant initial cost savings; however, use case(s) will always drive the total cost of any solution.
Coincidentally, eBay has also released information on their production 50 PB Hadoop cluster, one of the largest such clusters in the world . The fact that eBay uses both types of systems demonstrates that there is a place for each, and that the difference may come down to price and purpose. Given the relative lower cost of Hadoop, I submit that it is easier to identify Big Data if we add cost to our definition. Therefore, Big Data is the result when (a) the sum of all your data’s characteristics coupled with (b) the resources required to achieve your use case exceeds (c) the cost you are willing/able to spend using traditional approaches. When that inflection point is reached, it is clearly time to consider other, non-traditional approaches for process offloading. Each unique situation warrants a cost/benefit analysis to determine if a general-purpose solution like Hadoop is right for your use case.
What are the Use Cases for Big Data?
Process offloading refers to the act of moving workloads from one implementation to another to achieve better suitability, performance, availability, etc., at a lower price point. Both traditional and non-traditional solutions have advantages and disadvantages given a particular workload, and they should be leveraged accordingly to maximize cost efficiencies. Let us examine Hadoop’s use cases for process offloading.
Hadoop is comprised of two major components: the Hadoop Distributed File System (HDFS) and MapReduce, a framework for writing applications to process large amounts of content over multiple nodes (servers). Hadoop is often referred to as a schemaless system because data is not forced into a schema upon ingest. Ultimately, there is a structure known as the key/value pair in which data is expressed as a collection of [key]->[value] tuples or records. This is the most fundamental data structure in computer science. Hadoop uses the key/value pair because nearly any data can be expressed, stored, processed and retrieved using this minimal structure. Because key/value is so rudimentary, a schema can be applied at query time based on the question being asked. This adds tremendous flexibility and differs significantly from traditional approaches like OLTP and OLAP, which require you to know/define the data model up front, and have an understanding of the questions you intend to ask. Figure 1 illustrates these different process flow models. Having to know what questions you intend to ask, and constructing a pre-defined schema will add artificial constraints to the answers you are able to get from the data.
Another issue with schema-based systems is scalability. Traditional relational architectures scale vertically with ease, but are difficult to design for horizontal scaling due to their rigid data structures (tables, table relationships, rows, columns, indices) which must be sharded or split across multiple nodes. The integrity of these structures must be maintained while offering near-real-time (on line) create, read, update and delete (CRUD) operations on data. This is not trivial, and it requires commercial companies to make significant financial investments to do it well, which drive up the cost of those solutions. As a schemaless system, the latest release of Hadoop (2.x) scales horizontally to 10,000+ nodes without the added complexity inherent to traditional MPP systems .
Many organizations have purpose-built solutions for asking business intelligence questions, providing disaster recovery/backup, etc., but scaling these solutions beyond an initial, narrowly defined usage for structured data usually involves significant cost increases. As a schemaless computational file system, Hadoop can be applied to an almost endless set of challenges at a lower cost. Below we walk through six higher-order use cases to illustrate how these savings can be realized:
1. Raw Storage/Data Lake: Backing up all the data your enterprise collects and creates daily, to include its historical holdings, for continuity of operations (COOP) and disaster recovery (DR) has previously been too expensive, and therefore unfeasible. Instead, businesses make difficult tradeoffs as to what will and will not be recoverable should disaster strike. Imagine the possibilities if you were able to economically store everything in your enterprise for the price of traditional commodity hard disks. Fortunately, Hadoop makes this dream a reality with its internally redundant data structure that by default makes three copies of all data written to HDFS. This scalable, schemaless raw storage lends itself conceptually to what is now being called a “data lake”. A data lake is based on the notion that data can be tagged with metadata about its source, contents, structure and other characteristics. These properties stay with the data as it is minimized into key-value pairs and written to the Hadoop file system. To process the data, all one needs to know is what data they wish to process leveraging these properties. This allows many different types of data to exist side-by-side within the simple structures of the Data Lake. The amount of pre-processing is minimal, as data is no longer fit into specific schemas up-front, making the data accessible to a wider variety of purposes. This would not be cost effective using traditional commercial systems.
2. Multi-Format Data Analysis: There are many different types of data beyond structured and unstructured text, to include audio, video, and images. Analyzing structured and unstructured text at scale can be an expensive and difficult challenge, but analyzing large collections of digital media is not even possible using traditional relational systems. Many businesses have previously been unable to unlock the potential of their data holdings due to an inability to process digital content, such as the ability to analyze and track objects in video, or to identify and extract biomarkers in healthcare images. HDFS accepts all these formats for analysis without the need for a schema. Hadoop’s ability to work with unstructured text and binary data (audio, video, imagery) extends well beyond the native capabilities offered by existing storage solutions, providing an enormous capability advantage.
3. Data Cleansing/Transformation Businesses often contend with multiple relational data models, unstructured text and streaming data. You likely need to correlate, cleanse, de-duplicate, synchronize and normalize/de-normalize these data sets as they move between databases and tools to create a complete, clean operating picture for downstream analysis. The vast majority of work in conducting analytics is often preparing the data for use. In addition, new initiatives to leverage autonomous self-reporting devices and sensors provide continuous streams of data, creating explosions in the amount of information if used in their raw form. Purpose-built technologies present challenges when attempting these types of tasks due to their reliance on schemas. General purpose solutions, like the Hadoop ecosystem, deliver an economical way of storing, pre-processing and/or summarizing these data sets and streams, thereby minimizing the unchecked growth in commercial licensing investments within your enterprise.
4. Data Exploration: When new questions arise, the relevant variables and their relationships must be identified from within your data before you can begin to calculate definitive answers. However, these elements are not always understood, nor are the best algorithms for analyzing the data. Exploration is often required in order to build a model that will answer the questions being asked. Traditional relational architectures with pre-defined schemas are not likely to provide a platform for discovery. In these cases, identifying key variables and useful analytic methods is a trial/error process. Hadoop provides a flexible, schemaless environment that reduces the friction associated with the iterative process of exploring and analyzing data when the model is unclear. Hadoop provides a sandbox for exploring data without having to increase commercial capacity or spend the time building new schemas.
5. Data Science & Personalization: Data science leverages tools and techniques from many different areas of study, to include statistics, machine learning, mathematics, probability/uncertainty modeling, etc., to surface meaning from data, and generate data-driven products. This is essentially the art of making data actionable, either by a user or a machine. Data science is not exclusive to Big Data, but there is tremendous knowledge potential in large data sets. One use of data science is for personalization, the act of exhaustively analyzing large quantities of related data, such as the online behaviors of millions of Internet users to in order to calculate recommendations for a specific individual. The results are then presented in the form of “you might also like” books, movies, and other targeted advertisements. These techniques are also being applied to healthcare where symptoms, genetics, treatments, and outcomes are being analyzed to optimize treatment for specific individuals to optimize treatments. Hadoop is a perfect platform for data collecting, synthesizing, munging, cleaning and joining disparate data sets for analysis to achieve decision-relevant insight.
6. Data Anonymization: Certain industries, perhaps healthcare more than any other, require anonymized data for research. Rules governing the release of such data to the public generally require the information contain no personally identifiable information (PII). In the case of healthcare specifically, the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule, released in 2003, governs the use and disclosure of protected health information (PHI). The Privacy Rule does not give a specific algorithm for achieving the level of de-identification required, and there are many ways to approach anonymization in general, which depend greatly on how the data will be used. If a portion of your business model relies on providing anonymized data to internal or external groups for analysis, you want that process to be as clear, efficient, and repeatable as possible. Hadoop provides a solution for codifying and institutionalizing these algorithms for your enterprise. This increases the speed and effectiveness of all groups depending on anonymized data, providing them with an approved, documented process, and an authoritative source from which to receive data.
This list is not meant to be exhaustive, and there are definitely other use cases. Each use case is applicable to a wide variety of domains, to include finance, cyber, healthcare, defense, and scientific research.
What is the Business Value of Big Data?
Our new definition of Big Data (when the cost of using all your data for your use case exceeds what you are able to spend using purpose-built technologies) lends itself to a cost/benefit analysis. Figure 2 establishes a rubric through which to express the decision calculus of Big Data for process offloading. This framework illustrates the components of cost, discussed below, that every CIO and CTO should take into account when evaluating solutions for their use cases.
Projects should always start with gathering and analyzing requirements. In an analytic context, these are the questions you want to ask of your data. Or more generally, how you intend to use the data you would like to store in Hadoop. These requirements have obvious implications for leveraging the relevant data assets.
The Data / Characteristics (AS-IS) corner of the triangle refers to all data related to your requirements, and all the attributes discussed earlier, to include the amount, how quickly it grows/changes, differences in type/structure, where it resides on the network, etc.
Once the associated data has been identified, a solution is identified and designed. In the case of data analysis, the solution often involves models and techniques to change and analyze the data to find answers. Overall, this step includes any processes, human or machine, that are necessary to get the results you are looking for.
The Purpose / Answers (TO-BE) corner is your end-state vision, which is sometimes expressed in terms of success criteria and/or key performance indicators. In the case of data science, this corner represents the answers you want from your data, in addition to how users should expect to access those answers, and how frequently the answers need to be updated (real-time, hourly, daily, monthly, etc).
Lastly, there are often numerous ways for this solution to be physically implemented. Each possible implementation requires specific people, intellect (expertise, experience), technology (licenses, support), time, and physical capital (power, space, cooling) to assemble and extend (write algorithms, or build solutions on top of) the desired end-state. There are many factors here to consider. For example, certain software licenses will charge by the number of users, which may limit your derived business value (in terms of productivity) if that cost prevents your entire team from leveraging the software. As well, the more data you have, the more physical or virtual compute resources you may need.
Together, these elements influence the total cost of the solution. Ultimately, cost is the tipping point that can cause you to change the scope of your requirements and timeline, the data you use, the models/techniques you employ, the answers you are able to achieve and the algorithms/technologies you implement. Often, it is necessary to find an affordable balance to achieve the organization’s goals and objectives. However, these trade-offs may cause you to compromise certain business objectives, and reduce the business value derived from the solution.
The business value of Hadoop is the result of overcoming the functional limitations established by the cost of scaling purpose-built technologies, and having to make fewer compromises to achieve your data-driven business objectives. This relationship between cost and business value is illustrated in Figure 2. By managing (containing or reducing) cost, it becomes possible to maintain or broaden your scope and implement the solution that is right for you. Hadoop may allow you to get more from your data, with a significantly lower cost investment, resulting in tangible economic value. If Hadoop is able to satisfy your use case, then it is likely you will benefit from cost containment (and possibly savings) by preventing or reducing the expansion of more expensive purpose-built technologies.
It is important to choose the right technology for your particular use case. Hadoop continues to mature as a widely supported open source solution nearing its ten year anniversary. It is also supported by several commercial vendors offering on-site support. Depending on your particular use case(s), Hadoop may or may not be the best solution. Some Big Data is consistent, known, structured, and aligns well to the use cases best served by purpose-built technologies. However, when you do not have that, or cost constraints limit your business value, it is time to consider using a general purpose solution like the Hadoop ecosystem for process offloading. The formula presented in this paper provides a lens for CIOs/CTOs to examine potential solutions, business objectives, and cost constraints. Hadoop’s low cost and broad applicability are definitely worth exploring. I recommend you conduct your own cost/benefit analysis to determine if Hadoop is right for you and your use case(s). You may find that relative to commercial products, Hadoop will allow you to achieve greater business value and substantial cost savings.
 O’Reilly Media, Inc., “What is Big Data?”. Big Data Now: 2012 Edition. Sebastopol, CA. October 2012. Found Online at: http://www.oreilly.com/data/free/big-data-now-2012.csp
 Normandeau, Kevin. “Beyond Volume, Variety and Velocity is the Issue of Big Data Veracity”. Inside BigData. September 2013. http://inside-bigdata.com/2013/09/12/beyond-volume-variety-velocity-issu...
 Harris, Derrick. “Teradata pluges 17% on Q3 warning: Is it economics or Hadoop?” Gigaom. October 2013. https://gigaom.com/2013/10/15/teradata-plunges-17-on-q3-warning-is-it-ec...
 Barth, Paul; Bean, Randy. “Get the Maximum Value Out of Your Big Data Initiative”. Harvard Business Review Blog Network. February 2013. http://blogs.hbr.org/2013/02/get-the-maximum-value-out-of-y/
 Ma, Ming. “Hadoop @ eBay Marketplaces”. Slideshare. June 2013. http://www.slideshare.net/Hadoop_Summit/ma-june27-140pmroom212v2
 Murthy, Arun. “Apache Hadoop YARN – Concepts and Applications”. Hortonworks. August 2012. http://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications/
About the author:
Jeremy Glesner is the Chief Technology Officer of Berico Technologies. Jeremy’s background is in information science and software engineering. Find him on Twitter at @jglesner (https://twitter.com/jglesner) and on Linkedin (http://www.linkedin.com/in/
Containers have changed the mind of IT in DevOps. They enable developers to work with dev, test, stage and production environments identically. Containers provide the right abstraction for microservices and many cloud platforms have integrated them into deployment pipelines. DevOps and Containers together help companies to achieve their business goals faster and more effectively. In his session at DevOps Summit, Ruslan Synytsky, CEO and Co-founder of Jelastic, reviewed the current landscape of...
Jul. 3, 2015 11:15 AM EDT Reads: 2,350
Explosive growth in connected devices. Enormous amounts of data for collection and analysis. Critical use of data for split-second decision making and actionable information. All three are factors in making the Internet of Things a reality. Yet, any one factor would have an IT organization pondering its infrastructure strategy. How should your organization enhance its IT framework to enable an Internet of Things implementation? In his session at @ThingsExpo, James Kirkland, Red Hat's Chief Arch...
Jul. 3, 2015 11:15 AM EDT Reads: 892
IT data is typically silo'd by the various tools in place. Unifying all the log, metric and event data in one analytics platform stops finger pointing and provides the end-to-end correlation. Logs, metrics and custom event data can be joined to tell the holistic story of your software and operations. For example, users can correlate code deploys to system performance to application error codes. In his session at DevOps Summit, Michael Demmer, VP of Engineering at Jut, will discuss how this can...
Jul. 3, 2015 11:00 AM EDT Reads: 579
The last decade was about virtual machines, but the next one is about containers. Containers enable a service to run on any host at any time. Traditional tools are starting to show cracks because they were not designed for this level of application portability. Now is the time to look at new ways to deploy and manage applications at scale. In his session at @DevOpsSummit, Brian “Redbeard” Harrington, a principal architect at CoreOS, will examine how CoreOS helps teams run in production. Attende...
Jul. 3, 2015 11:00 AM EDT Reads: 658
Containers are revolutionizing the way we deploy and maintain our infrastructures, but monitoring and troubleshooting in a containerized environment can still be painful and impractical. Understanding even basic resource usage is difficult – let alone tracking network connections or malicious activity. In his session at DevOps Summit, Gianluca Borello, Sr. Software Engineer at Sysdig, will cover the current state of the art for container monitoring and visibility, including pros / cons and liv...
Jul. 3, 2015 11:00 AM EDT Reads: 677
Live Webinar with 451 Research Analyst Peter Christy. Join us on Wednesday July 22, 2015, at 10 am PT / 1 pm ET In a world where users are on the Internet and the applications are in the cloud, how do you maintain your historic SLA with your users? Peter Christy, Research Director, Networks at 451 Research, will discuss this new network paradigm, one in which there is no LAN and no WAN, and discuss what users and network administrators gain and give up when migrating to the agile world of clo...
Jul. 3, 2015 11:00 AM EDT Reads: 754
Agile, which started in the development organization, has gradually expanded into other areas downstream - namely IT and Operations. Teams – then teams of teams – have streamlined processes, improved feedback loops and driven a much faster pace into IT departments which have had profound effects on the entire organization. In his session at DevOps Summit, Anders Wallgren, Chief Technology Officer of Electric Cloud, will discuss how DevOps and Continuous Delivery have emerged to help connect dev...
Jul. 3, 2015 11:00 AM EDT Reads: 1,056
WebRTC converts the entire network into a ubiquitous communications cloud thereby connecting anytime, anywhere through any point. In his session at WebRTC Summit,, Mark Castleman, EIR at Bell Labs and Head of Future X Labs, will discuss how the transformational nature of communications is achieved through the democratizing force of WebRTC. WebRTC is doing for voice what HTML did for web content.
Jul. 3, 2015 10:00 AM EDT Reads: 743
The Internet of Things is not only adding billions of sensors and billions of terabytes to the Internet. It is also forcing a fundamental change in the way we envision Information Technology. For the first time, more data is being created by devices at the edge of the Internet rather than from centralized systems. What does this mean for today's IT professional? In this Power Panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, panelists addressed this very serious issue of pro...
Jul. 3, 2015 09:00 AM EDT Reads: 1,271
"A lot of the enterprises that have been using our systems for many years are reaching out to the cloud - the public cloud, the private cloud and hybrid," stated Reuven Harrison, CTO and Co-Founder of Tufin, in this SYS-CON.tv interview at 16th Cloud Expo, held June 9-11, 2015, at the Javits Center in New York City.
Jul. 3, 2015 09:00 AM EDT Reads: 783
"We got started as search consultants. On the services side of the business we have help organizations save time and save money when they hit issues that everyone more or less hits when their data grows," noted Otis Gospodnetić, Founder of Sematext, in this SYS-CON.tv interview at @DevOpsSummit, held June 9-11, 2015, at the Javits Center in New York City.
Jul. 3, 2015 08:45 AM EDT Reads: 1,094
Internet of Things (IoT) will be a hybrid ecosystem of diverse devices and sensors collaborating with operational and enterprise systems to create the next big application. In their session at @ThingsExpo, Bramh Gupta, founder and CEO of robomq.io, and Fred Yatzeck, principal architect leading product development at robomq.io, discussed how choosing the right middleware and integration strategy from the get-go will enable IoT solution developers to adapt and grow with the industry, while at th...
Jul. 3, 2015 08:15 AM EDT Reads: 2,067
"We have a tagline - "Power in the API Economy." What that means is everything that is built in applications and connected applications is done through APIs," explained Roberto Medrano, Executive Vice President at Akana, in this SYS-CON.tv interview at 16th Cloud Expo, held June 9-11, 2015, at the Javits Center in New York City.
Jul. 3, 2015 08:00 AM EDT Reads: 1,105
Internet of Things is moving from being a hype to a reality. Experts estimate that internet connected cars will grow to 152 million, while over 100 million internet connected wireless light bulbs and lamps will be operational by 2020. These and many other intriguing statistics highlight the importance of Internet powered devices and how market penetration is going to multiply many times over in the next few years.
Jul. 3, 2015 07:45 AM EDT Reads: 2,236
In his session at 16th Cloud Expo, Simone Brunozzi, VP and Chief Technologist of Cloud Services at VMware, reviewed the changes that the cloud computing industry has gone through over the last five years and shared insights into what the next five will bring. He also chronicled the challenges enterprise companies are facing as they move to the public cloud. He delved into the "Hybrid Cloud" space and explained why every CIO should consider ‘hybrid cloud' as part of their future strategy to achi...
Jul. 3, 2015 07:30 AM EDT Reads: 873
"We help to transform an organization and their operations and make them more efficient, more agile, and more nimble to move into the cloud or to move between cloud providers and create an agnostic tool set," noted Jeremy Steinert, DevOps Services Practice Lead at WSM International, in this SYS-CON.tv interview at @DevOpsSummit, held June 9-11, 2015, at the Javits Center in New York City.
Jul. 2, 2015 04:13 PM EDT Reads: 693
The basic integration architecture, as defined by ESBs, hasn’t changed for more than a decade. Most cloud integration providers still rely on an ESB architecture and their proprietary connectors. As a result, enterprise integration projects suffer from constraints of availability and reliability of these connectors that are not re-usable across other integration vendors. However, the rapid adoption of APIs and almost ubiquitous availability of APIs amongst most SaaS and Cloud applications are ra...
Jul. 2, 2015 03:18 PM EDT Reads: 726
"What Dyn is able to do with our Internet performance and our Internet intelligence is give companies visibility into what is actually going on in that cloud," noted Corey Hamilton, Product Marketing Manager at Dyn, in this SYS-CON.tv interview at 16th Cloud Expo, held June 9-11, 2015, at the Javits Center in New York City.
Jul. 2, 2015 02:52 PM EDT Reads: 714
Manufacturing has widely adopted standardized and automated processes to create designs, build them, and maintain them through their life cycle. However, many modern manufacturing systems go beyond mechanized workflows to introduce empowered workers, flexible collaboration, and rapid iteration. Such behaviors also characterize open source software development and are at the heart of DevOps culture, processes, and tooling.
Jul. 2, 2015 02:15 PM EDT Reads: 938
Today air travel is a minefield of delays, hassles and customer disappointment. Airlines struggle to revitalize the experience. GE and M2Mi will demonstrate practical examples of how IoT solutions are helping airlines bring back personalization, reduce trip time and improve reliability. In their session at @ThingsExpo, Shyam Varan Nath, Principal Architect with GE, and Dr. Sarah Cooper, M2Mi’s VP Business Development and Engineering, will explore the IoT cloud-based platform technologies drivi...
Jul. 2, 2015 02:00 PM EDT Reads: 1,097