|By Bob Gourley||
|December 13, 2014 12:00 PM EST||
Since the birth of Hadoop in 2005-06, the way we think about storing and processing information has evolved considerably. The term “Big Data” has become synonymous with this evolution. But still, many of our customers continue to ask, “What is Big Data?”, “What are its use cases?”, and “What is its business value?”. The Internet is overloaded with definitions, characteristics, and benefits; however, few discussions synthesize all three of these topics in one place. This paper answers these questions, and proposes a total cost calculation framework for CTOs and CIOs that are evaluating solutions for their organization’s use case(s). In the text below, I examine an on-premise Hadoop ecosystem as a general purpose Big Data solution in relation to alternative commercial purpose-built storage technologies (-e.g. Oracle, Teradata, IBM, SAP, Microsoft, EMC, etc). It may be difficult to determine the exact point at which you should leverage one over the other. It is my contention that when the total cost of using all your data exceeds what you are able to spend using purpose-built technologies, it is time to consider using a general purpose solution like Hadoop for process offloading.
What is Big Data?
According to Edd Dumbill, a well respected thought leader and VP of Strategy for Silicon Valley Data Science, a big data and data science consulting company, Big Data is “data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or does not fit the structures of your database architectures. To gain value from this data, you must choose an alternative way to process it.” This definition was published in an article entitled “What is Big Data?” in Big Data Now: 2012 Edition by O’Reilly Media, and touches on the three primary characteristics of data :
- Volume: The size of your data. Data has mass, and there is a cost to moving it around the network.
- Velocity: The speed with which the data either arrives or is created, and how quickly it needs to be consumed in order to make use of it.
- Variety: The differences in structure between all the types of data within an enterprise.
Scour the internet and you will find that there are other, less commonly discussed but relevant characteristics associated with Big Data to include Veracity and Volatility . Veracity refers to the truthfulness of the data or your degree of trust in what it is conveying. Volatility is how often your existing data changes or is updated by the new data you are receiving/creating. There are certainly other characteristics as well. I recently began using the term Viscosity to describe the degree of data fragmentation in client environments, and the level of effort required to reassemble it into a coherent view. In this context, organizations with low viscosity have significant fragmentation, and duplication throughout their enterprise.
The term “Big Data” has come to focus on these characteristics and imply that traditional database architectures, such as on line transaction processing (OLTP) and on line analytic processing (OLAP) purpose-built technologies simply will not scale to meet your data capacity needs. However, massively parallel processing (MPP) database architectures are one example of purpose-built technologies that have been developed to support both OLTP and OLAP data structures at enormous scales up into the petabytes (PB). For example, there is a 50 PB MPP cluster at eBay . Certainly this size conforms to any logical definition of Big Data.
Depending on your use case, it is possible that a purpose-built technology may suite your needs at scale. While MPP systems remain most effective with structured, tabular and transactional data sets, it is possible to store most everything except massive files in relational structures. However, this may not be the best fit for your use case(s) in terms of appropriateness or cost. There is limited published pricing data for commercial offerings, but MPP systems are notably expensive. When including the cost of software, hardware, and licensing/support, the cost per terabyte (TB) of an MPP system is estimated at tens of thousands of dollars . At these prices, a one PB system can cost tens of millions of dollars. In contrast, the equivalent cost of Hadoop is roughly $2,000 per TB, leaving a one PB Hadoop cluster to cost roughly $2 million. That is a significant initial cost savings; however, use case(s) will always drive the total cost of any solution.
Coincidentally, eBay has also released information on their production 50 PB Hadoop cluster, one of the largest such clusters in the world . The fact that eBay uses both types of systems demonstrates that there is a place for each, and that the difference may come down to price and purpose. Given the relative lower cost of Hadoop, I submit that it is easier to identify Big Data if we add cost to our definition. Therefore, Big Data is the result when (a) the sum of all your data’s characteristics coupled with (b) the resources required to achieve your use case exceeds (c) the cost you are willing/able to spend using traditional approaches. When that inflection point is reached, it is clearly time to consider other, non-traditional approaches for process offloading. Each unique situation warrants a cost/benefit analysis to determine if a general-purpose solution like Hadoop is right for your use case.
What are the Use Cases for Big Data?
Process offloading refers to the act of moving workloads from one implementation to another to achieve better suitability, performance, availability, etc., at a lower price point. Both traditional and non-traditional solutions have advantages and disadvantages given a particular workload, and they should be leveraged accordingly to maximize cost efficiencies. Let us examine Hadoop’s use cases for process offloading.
Hadoop is comprised of two major components: the Hadoop Distributed File System (HDFS) and MapReduce, a framework for writing applications to process large amounts of content over multiple nodes (servers). Hadoop is often referred to as a schemaless system because data is not forced into a schema upon ingest. Ultimately, there is a structure known as the key/value pair in which data is expressed as a collection of [key]->[value] tuples or records. This is the most fundamental data structure in computer science. Hadoop uses the key/value pair because nearly any data can be expressed, stored, processed and retrieved using this minimal structure. Because key/value is so rudimentary, a schema can be applied at query time based on the question being asked. This adds tremendous flexibility and differs significantly from traditional approaches like OLTP and OLAP, which require you to know/define the data model up front, and have an understanding of the questions you intend to ask. Figure 1 illustrates these different process flow models. Having to know what questions you intend to ask, and constructing a pre-defined schema will add artificial constraints to the answers you are able to get from the data.
Another issue with schema-based systems is scalability. Traditional relational architectures scale vertically with ease, but are difficult to design for horizontal scaling due to their rigid data structures (tables, table relationships, rows, columns, indices) which must be sharded or split across multiple nodes. The integrity of these structures must be maintained while offering near-real-time (on line) create, read, update and delete (CRUD) operations on data. This is not trivial, and it requires commercial companies to make significant financial investments to do it well, which drive up the cost of those solutions. As a schemaless system, the latest release of Hadoop (2.x) scales horizontally to 10,000+ nodes without the added complexity inherent to traditional MPP systems .
Many organizations have purpose-built solutions for asking business intelligence questions, providing disaster recovery/backup, etc., but scaling these solutions beyond an initial, narrowly defined usage for structured data usually involves significant cost increases. As a schemaless computational file system, Hadoop can be applied to an almost endless set of challenges at a lower cost. Below we walk through six higher-order use cases to illustrate how these savings can be realized:
1. Raw Storage/Data Lake: Backing up all the data your enterprise collects and creates daily, to include its historical holdings, for continuity of operations (COOP) and disaster recovery (DR) has previously been too expensive, and therefore unfeasible. Instead, businesses make difficult tradeoffs as to what will and will not be recoverable should disaster strike. Imagine the possibilities if you were able to economically store everything in your enterprise for the price of traditional commodity hard disks. Fortunately, Hadoop makes this dream a reality with its internally redundant data structure that by default makes three copies of all data written to HDFS. This scalable, schemaless raw storage lends itself conceptually to what is now being called a “data lake”. A data lake is based on the notion that data can be tagged with metadata about its source, contents, structure and other characteristics. These properties stay with the data as it is minimized into key-value pairs and written to the Hadoop file system. To process the data, all one needs to know is what data they wish to process leveraging these properties. This allows many different types of data to exist side-by-side within the simple structures of the Data Lake. The amount of pre-processing is minimal, as data is no longer fit into specific schemas up-front, making the data accessible to a wider variety of purposes. This would not be cost effective using traditional commercial systems.
2. Multi-Format Data Analysis: There are many different types of data beyond structured and unstructured text, to include audio, video, and images. Analyzing structured and unstructured text at scale can be an expensive and difficult challenge, but analyzing large collections of digital media is not even possible using traditional relational systems. Many businesses have previously been unable to unlock the potential of their data holdings due to an inability to process digital content, such as the ability to analyze and track objects in video, or to identify and extract biomarkers in healthcare images. HDFS accepts all these formats for analysis without the need for a schema. Hadoop’s ability to work with unstructured text and binary data (audio, video, imagery) extends well beyond the native capabilities offered by existing storage solutions, providing an enormous capability advantage.
3. Data Cleansing/Transformation Businesses often contend with multiple relational data models, unstructured text and streaming data. You likely need to correlate, cleanse, de-duplicate, synchronize and normalize/de-normalize these data sets as they move between databases and tools to create a complete, clean operating picture for downstream analysis. The vast majority of work in conducting analytics is often preparing the data for use. In addition, new initiatives to leverage autonomous self-reporting devices and sensors provide continuous streams of data, creating explosions in the amount of information if used in their raw form. Purpose-built technologies present challenges when attempting these types of tasks due to their reliance on schemas. General purpose solutions, like the Hadoop ecosystem, deliver an economical way of storing, pre-processing and/or summarizing these data sets and streams, thereby minimizing the unchecked growth in commercial licensing investments within your enterprise.
4. Data Exploration: When new questions arise, the relevant variables and their relationships must be identified from within your data before you can begin to calculate definitive answers. However, these elements are not always understood, nor are the best algorithms for analyzing the data. Exploration is often required in order to build a model that will answer the questions being asked. Traditional relational architectures with pre-defined schemas are not likely to provide a platform for discovery. In these cases, identifying key variables and useful analytic methods is a trial/error process. Hadoop provides a flexible, schemaless environment that reduces the friction associated with the iterative process of exploring and analyzing data when the model is unclear. Hadoop provides a sandbox for exploring data without having to increase commercial capacity or spend the time building new schemas.
5. Data Science & Personalization: Data science leverages tools and techniques from many different areas of study, to include statistics, machine learning, mathematics, probability/uncertainty modeling, etc., to surface meaning from data, and generate data-driven products. This is essentially the art of making data actionable, either by a user or a machine. Data science is not exclusive to Big Data, but there is tremendous knowledge potential in large data sets. One use of data science is for personalization, the act of exhaustively analyzing large quantities of related data, such as the online behaviors of millions of Internet users to in order to calculate recommendations for a specific individual. The results are then presented in the form of “you might also like” books, movies, and other targeted advertisements. These techniques are also being applied to healthcare where symptoms, genetics, treatments, and outcomes are being analyzed to optimize treatment for specific individuals to optimize treatments. Hadoop is a perfect platform for data collecting, synthesizing, munging, cleaning and joining disparate data sets for analysis to achieve decision-relevant insight.
6. Data Anonymization: Certain industries, perhaps healthcare more than any other, require anonymized data for research. Rules governing the release of such data to the public generally require the information contain no personally identifiable information (PII). In the case of healthcare specifically, the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule, released in 2003, governs the use and disclosure of protected health information (PHI). The Privacy Rule does not give a specific algorithm for achieving the level of de-identification required, and there are many ways to approach anonymization in general, which depend greatly on how the data will be used. If a portion of your business model relies on providing anonymized data to internal or external groups for analysis, you want that process to be as clear, efficient, and repeatable as possible. Hadoop provides a solution for codifying and institutionalizing these algorithms for your enterprise. This increases the speed and effectiveness of all groups depending on anonymized data, providing them with an approved, documented process, and an authoritative source from which to receive data.
This list is not meant to be exhaustive, and there are definitely other use cases. Each use case is applicable to a wide variety of domains, to include finance, cyber, healthcare, defense, and scientific research.
What is the Business Value of Big Data?
Our new definition of Big Data (when the cost of using all your data for your use case exceeds what you are able to spend using purpose-built technologies) lends itself to a cost/benefit analysis. Figure 2 establishes a rubric through which to express the decision calculus of Big Data for process offloading. This framework illustrates the components of cost, discussed below, that every CIO and CTO should take into account when evaluating solutions for their use cases.
Projects should always start with gathering and analyzing requirements. In an analytic context, these are the questions you want to ask of your data. Or more generally, how you intend to use the data you would like to store in Hadoop. These requirements have obvious implications for leveraging the relevant data assets.
The Data / Characteristics (AS-IS) corner of the triangle refers to all data related to your requirements, and all the attributes discussed earlier, to include the amount, how quickly it grows/changes, differences in type/structure, where it resides on the network, etc.
Once the associated data has been identified, a solution is identified and designed. In the case of data analysis, the solution often involves models and techniques to change and analyze the data to find answers. Overall, this step includes any processes, human or machine, that are necessary to get the results you are looking for.
The Purpose / Answers (TO-BE) corner is your end-state vision, which is sometimes expressed in terms of success criteria and/or key performance indicators. In the case of data science, this corner represents the answers you want from your data, in addition to how users should expect to access those answers, and how frequently the answers need to be updated (real-time, hourly, daily, monthly, etc).
Lastly, there are often numerous ways for this solution to be physically implemented. Each possible implementation requires specific people, intellect (expertise, experience), technology (licenses, support), time, and physical capital (power, space, cooling) to assemble and extend (write algorithms, or build solutions on top of) the desired end-state. There are many factors here to consider. For example, certain software licenses will charge by the number of users, which may limit your derived business value (in terms of productivity) if that cost prevents your entire team from leveraging the software. As well, the more data you have, the more physical or virtual compute resources you may need.
Together, these elements influence the total cost of the solution. Ultimately, cost is the tipping point that can cause you to change the scope of your requirements and timeline, the data you use, the models/techniques you employ, the answers you are able to achieve and the algorithms/technologies you implement. Often, it is necessary to find an affordable balance to achieve the organization’s goals and objectives. However, these trade-offs may cause you to compromise certain business objectives, and reduce the business value derived from the solution.
The business value of Hadoop is the result of overcoming the functional limitations established by the cost of scaling purpose-built technologies, and having to make fewer compromises to achieve your data-driven business objectives. This relationship between cost and business value is illustrated in Figure 2. By managing (containing or reducing) cost, it becomes possible to maintain or broaden your scope and implement the solution that is right for you. Hadoop may allow you to get more from your data, with a significantly lower cost investment, resulting in tangible economic value. If Hadoop is able to satisfy your use case, then it is likely you will benefit from cost containment (and possibly savings) by preventing or reducing the expansion of more expensive purpose-built technologies.
It is important to choose the right technology for your particular use case. Hadoop continues to mature as a widely supported open source solution nearing its ten year anniversary. It is also supported by several commercial vendors offering on-site support. Depending on your particular use case(s), Hadoop may or may not be the best solution. Some Big Data is consistent, known, structured, and aligns well to the use cases best served by purpose-built technologies. However, when you do not have that, or cost constraints limit your business value, it is time to consider using a general purpose solution like the Hadoop ecosystem for process offloading. The formula presented in this paper provides a lens for CIOs/CTOs to examine potential solutions, business objectives, and cost constraints. Hadoop’s low cost and broad applicability are definitely worth exploring. I recommend you conduct your own cost/benefit analysis to determine if Hadoop is right for you and your use case(s). You may find that relative to commercial products, Hadoop will allow you to achieve greater business value and substantial cost savings.
 O’Reilly Media, Inc., “What is Big Data?”. Big Data Now: 2012 Edition. Sebastopol, CA. October 2012. Found Online at: http://www.oreilly.com/data/free/big-data-now-2012.csp
 Normandeau, Kevin. “Beyond Volume, Variety and Velocity is the Issue of Big Data Veracity”. Inside BigData. September 2013. http://inside-bigdata.com/2013/09/12/beyond-volume-variety-velocity-issu...
 Harris, Derrick. “Teradata pluges 17% on Q3 warning: Is it economics or Hadoop?” Gigaom. October 2013. https://gigaom.com/2013/10/15/teradata-plunges-17-on-q3-warning-is-it-ec...
 Barth, Paul; Bean, Randy. “Get the Maximum Value Out of Your Big Data Initiative”. Harvard Business Review Blog Network. February 2013. http://blogs.hbr.org/2013/02/get-the-maximum-value-out-of-y/
 Ma, Ming. “Hadoop @ eBay Marketplaces”. Slideshare. June 2013. http://www.slideshare.net/Hadoop_Summit/ma-june27-140pmroom212v2
 Murthy, Arun. “Apache Hadoop YARN – Concepts and Applications”. Hortonworks. August 2012. http://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications/
About the author:
Jeremy Glesner is the Chief Technology Officer of Berico Technologies. Jeremy’s background is in information science and software engineering. Find him on Twitter at @jglesner (https://twitter.com/jglesner) and on Linkedin (http://www.linkedin.com/in/
IoT is rapidly changing the way enterprises are using data to improve business decision-making. In order to derive business value, organizations must unlock insights from the data gathered and then act on these. In their session at @ThingsExpo, Eric Hoffman, Vice President at EastBanc Technologies, and Peter Shashkin, Head of Development Department at EastBanc Technologies, discussed how one organization leveraged IoT, cloud technology and data analysis to improve customer experiences and effi...
Jul. 24, 2016 10:00 PM EDT Reads: 1,940
Let’s face it, embracing new storage technologies, capabilities and upgrading to new hardware often adds complexity and increases costs. In his session at 18th Cloud Expo, Seth Oxenhorn, Vice President of Business Development & Alliances at FalconStor, discussed how a truly heterogeneous software-defined storage approach can add value to legacy platforms and heterogeneous environments. The result reduces complexity, significantly lowers cost, and provides IT organizations with improved efficienc...
Jul. 24, 2016 09:45 PM EDT Reads: 1,902
Organizations planning enterprise data center consolidation and modernization projects are faced with a challenging, costly reality. Requirements to deploy modern, cloud-native applications simultaneously with traditional client/server applications are almost impossible to achieve with hardware-centric enterprise infrastructure. Compute and network infrastructure are fast moving down a software-defined path, but storage has been a laggard. Until now.
Jul. 24, 2016 09:45 PM EDT Reads: 1,624
The Internet of Things will challenge the status quo of how IT and development organizations operate. Or will it? Certainly the fog layer of IoT requires special insights about data ontology, security and transactional integrity. But the developmental challenges are the same: People, Process and Platform and how we integrate our thinking to solve complicated problems. In his session at 19th Cloud Expo, Craig Sproule, CEO of Metavine, will demonstrate how to move beyond today's coding paradigm ...
Jul. 24, 2016 09:45 PM EDT Reads: 2,116
"We view the cloud not really as a specific technology but as a way of doing business and that way of doing business is transforming the way software, infrastructure and services are being delivered to business," explained Matthew Rosen, CEO and Director at Fusion, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
Jul. 24, 2016 09:00 PM EDT Reads: 1,463
Internet of @ThingsExpo, taking place November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with the 19th International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world and ThingsExpo Silicon Valley Call for Papers is now open.
Jul. 24, 2016 09:00 PM EDT Reads: 2,457
Big Data engines are powering a lot of service businesses right now. Data is collected from users from wearable technologies, web behaviors, purchase behavior as well as several arbitrary data points we’d never think of. The demand for faster and bigger engines to crunch and serve up the data to services is growing exponentially. You see a LOT of correlation between “Cloud” and “Big Data” but on Big Data and “Hybrid,” where hybrid hosting is the sanest approach to the Big Data Infrastructure pro...
Jul. 24, 2016 07:45 PM EDT Reads: 1,845
A critical component of any IoT project is what to do with all the data being generated. This data needs to be captured, processed, structured, and stored in a way to facilitate different kinds of queries. Traditional data warehouse and analytical systems are mature technologies that can be used to handle certain kinds of queries, but they are not always well suited to many problems, particularly when there is a need for real-time insights.
Jul. 24, 2016 07:30 PM EDT Reads: 1,687
"My role is working with customers, helping them go through this digital transformation. I spend a lot of time talking to banks, big industries, manufacturers working through how they are integrating and transforming their IT platforms and moving them forward," explained William Morrish, General Manager Product Sales at Interoute, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
Jul. 24, 2016 07:30 PM EDT Reads: 2,048
In his session at 18th Cloud Expo, Sagi Brody, Chief Technology Officer at Webair Internet Development Inc., and Logan Best, Infrastructure & Network Engineer at Webair, focused on real world deployments of DDoS mitigation strategies in every layer of the network. He gave an overview of methods to prevent these attacks and best practices on how to provide protection in complex cloud platforms. He also outlined what we have found in our experience managing and running thousands of Linux and Unix ...
Jul. 24, 2016 07:30 PM EDT Reads: 1,706
Continuous testing helps bridge the gap between developing quickly and maintaining high quality products. But to implement continuous testing, CTOs must take a strategic approach to building a testing infrastructure and toolset that empowers their team to move fast. Download our guide to laying the groundwork for a scalable continuous testing strategy.
Jul. 24, 2016 07:15 PM EDT Reads: 1,854
With 15% of enterprises adopting a hybrid IT strategy, you need to set a plan to integrate hybrid cloud throughout your infrastructure. In his session at 18th Cloud Expo, Steven Dreher, Director of Solutions Architecture at Green House Data, discussed how to plan for shifting resource requirements, overcome challenges, and implement hybrid IT alongside your existing data center assets. Highlights included anticipating workload, cost and resource calculations, integrating services on both sides...
Jul. 24, 2016 07:00 PM EDT Reads: 1,926
In his session at @DevOpsSummit at 19th Cloud Expo, Yoseph Reuveni, Director of Software Engineering at Jet.com, will discuss Jet.com's journey into containerizing Microsoft-based technologies like C# and F# into Docker. He will talk about lessons learned and challenges faced, the Mono framework tryout and how they deployed everything into Azure cloud. Yoseph Reuveni is a technology leader with unique experience developing and running high throughput (over 1M tps) distributed systems with extre...
Jul. 24, 2016 06:45 PM EDT Reads: 2,033
"We are a well-established player in the application life cycle management market and we also have a very strong version control product," stated Flint Brenton, CEO of CollabNet,, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
Jul. 24, 2016 06:45 PM EDT Reads: 1,756
"Software-defined storage is a big problem in this industry because so many people have different definitions as they see fit to use it," stated Peter McCallum, VP of Datacenter Solutions at FalconStor Software, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
Jul. 24, 2016 06:15 PM EDT Reads: 1,376
"Operations is sort of the maturation of cloud utilization and the move to the cloud," explained Steve Anderson, Product Manager for BMC’s Cloud Lifecycle Management, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
Jul. 24, 2016 06:00 PM EDT Reads: 1,839
The cloud competition for database hosts is fierce. How do you evaluate a cloud provider for your database platform? In his session at 18th Cloud Expo, Chris Presley, a Solutions Architect at Pythian, gave users a checklist of considerations when choosing a provider. Chris Presley is a Solutions Architect at Pythian. He loves order – making him a premier Microsoft SQL Server expert. Not only has he programmed and administered SQL Server, but he has also shared his expertise and passion with b...
Jul. 24, 2016 06:00 PM EDT Reads: 1,853
Unless your company can spend a lot of money on new technology, re-engineering your environment and hiring a comprehensive cybersecurity team, you will most likely move to the cloud or seek external service partnerships. In his session at 18th Cloud Expo, Darren Guccione, CEO of Keeper Security, revealed what you need to know when it comes to encryption in the cloud.
Jul. 24, 2016 05:00 PM EDT Reads: 2,351
We're entering the post-smartphone era, where wearable gadgets from watches and fitness bands to glasses and health aids will power the next technological revolution. With mass adoption of wearable devices comes a new data ecosystem that must be protected. Wearables open new pathways that facilitate the tracking, sharing and storing of consumers’ personal health, location and daily activity data. Consumers have some idea of the data these devices capture, but most don’t realize how revealing and...
Jul. 24, 2016 05:00 PM EDT Reads: 2,025
What are the successful IoT innovations from emerging markets? What are the unique challenges and opportunities from these markets? How did the constraints in connectivity among others lead to groundbreaking insights? In her session at @ThingsExpo, Carmen Feliciano, a Principal at AMDG, will answer all these questions and share how you can apply IoT best practices and frameworks from the emerging markets to your own business.
Jul. 24, 2016 04:15 PM EDT Reads: 1,546