Welcome!

Cloud Expo Authors: Pat Romanski, Carmen Gonzalez, Victoria Livschitz, Larry Dragich, Elizabeth White

Related Topics: Cloud Expo, SOA & WOA, Virtualization, Web 2.0, Apache, Security

Cloud Expo: Article

Arrival of Big Data Opens Up a New Range of Analytics

It's happening: Hadoop and SQL worlds are converging

With Strata, IBM IOD, and Teradata Partners conferences all occurring this week, it’s not surprising that this is a big week for Hadoop-related announcements. The common thread of announcements is essentially, “We know that Hadoop is not known for performance, but we’re getting better at it, and we’re going to make it look more like SQL.” In essence, Hadoop and SQL worlds are converging, and you’re going to be able to perform interactive BI analytics on it.

The opportunity and challenge of Big Data from new platforms such as Hadoop is that it opens a new range of analytics. On one hand, Big Data analytics have updated and revived programmatic access to data, which happened to be the norm prior to the advent of SQL. There are plenty of scenarios where taking programmatic approaches are far more efficient, such as dealing with time series data or graph analysis to map many-to-many relationships.

It also leverages in-memory data grids such as Oracle Coherence, IBM WebSphere eXtreme Scale, GigaSpaces and others, and, where programmatic development (usually in Java) proved more efficient for accessing highly changeable data for web applications where traditional paths to the database would have been I/O-constrained. Conversely Advanced SQL platforms such as Greenplum and Teradata Aster have provided support for MapReduce-like programming because, even with structured data, sometimes using a Java programmatic framework is a more efficient way to rapidly slice through volumes of data.

But when you talk analytics, you can’t simply write off the legions of SQL developers that populate enterprise IT shops.

Until now, Hadoop has not until now been for the SQL-minded. The initial path was, find someone to do data exploration inside Hadoop, but once you’re ready to do repeatable analysis, ETL (or ELT) it into a SQL data warehouse. That’s been the pattern with Oracle Big Data Appliance (use Oracle loader and data integration tools), and most Advanced SQL platforms; most data integration tools provide Hadoop connectors that spawn their own MapReduce programs to ferry data out of Hadoop. Some integration tool providers, like Informatica, offer tools to automate parsing of Hadoop data. Teradata Aster and Hortonworks have been talking up the potentials of HCatalog, in actuality an enhanced version of Hive with RESTful interfaces, cost optimizers, and so on, to provide a more SQL friendly view of data residing inside Hadoop.

But when you talk analytics, you can’t simply write off the legions of SQL developers that populate enterprise IT shops. And beneath the veneer of chaos, there is an implicit order to most so-called “unstructured” data that is within the reach programmatic transformation approaches that in the long run could likely be automated or packaged inside a tool.

At Ovum, we have long believed that for Big Data to crossover to the mainstream enterprise, that it must become a first-class citizen with IT and the data center. The early pattern of skunk works projects, led by elite, highly specialized teams of software engineers from Internet firms to solve Internet-style problems (e.g., ad placement, search optimization, customer online experience, etc.) are not the problems of mainstream enterprises. And neither is the model of recruiting high-priced talent to work exclusively on Hadoop sustainable for most organizations; such staffing models are not sustainable for mainstream enterprises. It means that Big Data must be consumable by the mainstream of SQL developers.

Making Hadoop more SQL-like is hardly new

Hive and Pig became Apache Hadoop projects because of the need for SQL-like metadata management and data transformation languages, respectively; HBase emerged because of the need for a table store to provide a more interactive face – although as a very sparse, rudimentary column store, does not provide the efficiency of an optimized SQL database (or the extreme performance of some columnar variants). Sqoop in turn provides a way to pipeline SQL data into Hadoop, a use case that will grow more common as organizations look to Hadoop to provide scalable and cheaper storage than commercial SQL. While these Hadoop subprojects that did not exactly make Hadoop look like SQL, they provided building blocks from which many of this week’s announcements leverage.

Progress marches on

One train of thought is that if Hadoop can look more like a SQL database, more operations could be performed inside Hadoop. That’s the theme behind Informatica’s long-awaited enhancement of its PowerCenter transformation tool to work natively inside Hadoop. Until now, PowerCenter could extract data from Hadoop, but the extracts would have to be moved to a staging server where the transformation would be performed for loading to the familiar SQL data warehouse target. The new offering, PowerCenter Big Data Edition, now supports an ELT pattern that uses the power of MapReduce processes inside Hadoop to perform transformations. The significance is that PowerCenter users now have a choice: load the transformed data to HBase, or continue loading to SQL.

There is growing support for packaging Hadoop inside a common hardware appliance with Advanced SQL. EMC Greenplum was the first out of gate with DCA (Data Computing Appliance) that bundles its own distribution of Apache Hadoop (not to be confused with Greenplum MR, a software only product that is accompanied by a MapR Hadoop distro).

Teradata Aster has just joined the fray with Big Analytics Appliance, bundling the Hortonworks Data Platform Hadoop; this move was hardly surprising given their growing partnership around HCatalog, an enhancement of the SQL-like Hive metadata layer of Hadoop that adds features such as a cost optimizer and RESTful interfaces that make the metadata accessible without the need to learn MapReduce or Java. With HCatalog, data inside Hadoop looks like another Aster data table.

Not coincidentally, there is a growing array of analytic tools that are designed to execute natively inside Hadoop. For now they are from emerging players like Datameer (providing a spreadsheet-like metaphor; which just announced an app store-like marketplace for developers), Karmasphere (providing an application develop tool for Hadoop analytic apps), or a more recent entry, Platfora (which caches subsets of Hadoop data in memory with an optimized, high performance fractal index).

Yet, even with Hadoop analytic tooling, there will still be a desire to disguise Hadoop as a SQL data store, and not just for data mapping purposes.

Yet, even with Hadoop analytic tooling, there will still be a desire to disguise Hadoop as a SQL data store, and not just for data mapping purposes. Hadapt has been promoting a variant where it squeezes SQL tables inside HDFS file structures – not exactly a no-brainer as it must shoehorn tables into a file system with arbitrary data block sizes. Hadapt’s approach sounds like the converse of object-relational stores, but in this case, it is dealing with a physical rather than a logical impedance mismatch.

Hadapt promotes the ability to query Hadoop directly using SQL. Now, so does Cloudera. It has just announced Impala, a SQL-based alternative to MapReduce for querying the SQL-like Hive metadata store, supporting most but not all forms of SQL processing (based on SQL 92; Impala lacks triggers, which Cloudera deems low priority). Both Impala and MapReduce rely on parallel processing, but that’s where the similarity ends. MapReduce is a blunt instrument, requiring Java or other programming languages; it splits a job into multiple, concurrently, pipelined tasks where, at each step along the way, reads data, processes it, and writes it back to disk and then passes it to the next task.

Conversely, Impala takes a shared nothing, MPP approach to processing SQL jobs against Hive; using HDFS, Cloudera claims roughly 4x performance against MapReduce; if the data is in HBase, Cloudera claims performance multiples up to a factor of 30. For now, Impala only supports row-based views, but with columnar (on Cloudera’s roadmap), performance could double. Cloudera plans to release a real-time query (RTQ) offering that, in effect, is a commercially supported version of Impala.

By contrast, Teradata Aster and Hortonworks promote a SQL MapReduce approach that leverages HCatalog, an incubating Apache project that is a superset of Hive that Cloudera does not currently include in its roadmap. For now, Cloudera claims bragging rights for performance with Impala; over time, Teradata Aster will promote the manageability of its single appliance, and with the appliance has the opportunity to counter with hardware optimization.

The road to SQL/programmatic convergence

Either way – and this is of interest only to purists – any SQL extension to Hadoop will be outside the Hadoop project. But again, that’s an argument for purists. What’s more important to enterprises is getting the right tool for the job – whether it is the flexibility of SQL or raw power of programmatic approaches.

SQL convergence is the next major battleground for Hadoop. Cloudera is for now shunning HCatalog, an approach backed by Hortonworks and partner Teradata Aster. The open question is whether Hortonworks can instigate a stampede of third parties to overcome Cloudera’s resistance. It appears that beyond Hive, the SQL face of Hadoop will become a vendor-differentiated layer.

Part of conversion will involve a mix of cross-training and tooling automation. Savvy SQL developers will cross train to pick up some of the Java- or Java-like programmatic frameworks that will be emerging. Tooling will help lower the bar, reducing the degree of specialized skills necessary.

And for programming frameworks, in the long run, MapReduce won’t be the only game in town. It will always be useful for large-scale jobs requiring brute force, parallel, sequential processing. But the emerging YARN framework, which deconstructs MapReduce to generalize the resource management function, will provide the management umbrella for ensuring that different frameworks don’t crash into one another by trying to grab the same resources. But YARN is not yet ready for primetime – for now it only supports the batch job pattern of MapReduce. And that means that YARN is not yet ready for Impala or vice versa.

Either way – and this is of interest only to purists – any SQL extension to Hadoop will be outside the Hadoop project. But again, that’s an argument for purists.

Of course, mainstreaming Hadoop – and Big Data platforms in general – is more than just a matter of making it all look like SQL. Big Data platforms must be manageable and operable by the people who are already in IT; they will need some new skills and grow accustomed to some new practices (like exploratory analytics), but the new platforms must also look and act familiar enough. Not all announcements this week were about SQL; for instance, MapR is throwing a gauntlet to the Apache usual suspects by extending its management umbrella beyond the proprietary NFS-compatible file system that is its core IP to the MapReduce framework and HBase, making a similar promise of high performance.

On the horizon, EMC Isilon and NetApp are proposing alternatives promising a more efficient file system but at the “cost” of separating the storage from the analytic processing. And at some point, the Hadoop vendor community will have to come to grips with capacity utilization issues, because in the mainstream enterprise world, no CFO will approve the purchase of large clusters or grids that get only 10 – 15 percent utilization. Keep an eye on VMware’s Project Serengeti.

They must be good citizens in data centers that need to maximize resource (e.g., virtualization, optimized storage); must comply with existing data stewardship policies and practices; and must fully support existing enterprise data and platform security practices. These are all topics for another day.

You may also be interested in:

More Stories By Tony Baer

Tony Baer is Principal Analyst with Ovum, leading Ovum’s research on the software lifecycle. Working in concert with other members of Ovum’s software group, his research covers the full lifecycle from design and development to deployment and management. Areas of focus include application lifecycle management, software development methodologies (including agile), SOA, IT service management/ITIL, and IT management/governance.

Baer has been a noted authority on software development platforms and integration architecture for nearly 20 years. Prior to joining Ovum, he was an independent analyst whose company ‘onStrategies’ delivered software development and integration tools to vendors with technology assessment and market positioning services. He also led Computerwire’s CIO Agenda and Computer Finance end-user best practices research services.

Follow him on Twitter @TonyBaer or read his blog site www.onstrategies.com/blog.

@CloudExpo Stories
15th Cloud Expo, which took place Nov. 4-6, 2014, at the Santa Clara Convention Center in Santa Clara, CA, expanded the conference content of @ThingsExpo, Big Data Expo, and DevOps Summit to include two developer events. IBM held a Bluemix Developer Playground on November 5 and ElasticBox held a Hackathon on November 6. Both events took place on the expo floor. The Bluemix Developer Playground, for developers of all levels, highlighted the ease of use of Bluemix, its services and functionalit...
The 3rd International Internet of @ThingsExpo, co-located with the 16th International Cloud Expo - to be held June 9-11, 2015, at the Javits Center in New York City, NY - announces that its Call for Papers is now open. The Internet of Things (IoT) is the biggest idea since the creation of the Worldwide Web more than 20 years ago.
DevOps is all about agility. However, you don't want to be on a high-speed bus to nowhere. The right DevOps approach controls velocity with a tight feedback loop that not only consists of operational data but also incorporates business context. With a business context in the decision making, the right business priorities are incorporated, which results in a higher value creation. In his session at DevOps Summit, Todd Rader, Solutions Architect at AppDynamics, discussed key monitoring techniques...
Some developers believe that monitoring is a function of the operations team. Some operations teams firmly believe that monitoring the systems they maintain is sufficient to run the business successfully. Most of them are wrong. The complexity of today's applications have gone far and beyond the capabilities of "traditional" system-level monitoring tools and approaches and requires much broader knowledge of business and applications as a whole. The goal of DevOps is to connect all aspects of app...
The 4th International DevOps Summit, co-located with16th International Cloud Expo – being held June 9-11, 2015, at the Javits Center in New York City, NY – announces that its Call for Papers is now open. Born out of proven success in agile development, cloud computing, and process automation, DevOps is a macro trend you cannot afford to miss. From showcase success stories from early adopters and web-scale businesses, DevOps is expanding to organizations of all sizes, including the world's large...
Want to enable self-service provisioning of application environments in minutes that mirror production? Can you automatically provide rich data with code-level detail back to the developers when issues occur in production? In his session at DevOps Summit, David Tesar, Microsoft Technical Evangelist on Microsoft Azure and DevOps, will discuss how to accomplish this and more utilizing technologies such as Microsoft Azure, Visual Studio online, and Application Insights in this demo-heavy session.
When an enterprise builds a hybrid IaaS cloud connecting its data center to one or more public clouds, security is often a major topic along with the other challenges involved. Security is closely intertwined with the networking choices made for the hybrid cloud. Traditional networking approaches for building a hybrid cloud try to kludge together the enterprise infrastructure with the public cloud. Consequently this approach requires risky, deep "surgery" including changes to firewalls, subnets...
An entirely new security model is needed for the Internet of Things, or is it? Can we save some old and tested controls for this new and different environment? In his session at @ThingsExpo, New York's at the Javits Center, Davi Ottenheimer, EMC Senior Director of Trust, reviewed hands-on lessons with IoT devices and reveal a new risk balance you might not expect. Davi Ottenheimer, EMC Senior Director of Trust, has more than nineteen years' experience managing global security operations and asse...
The Internet of Things will greatly expand the opportunities for data collection and new business models driven off of that data. In her session at @ThingsExpo, Esmeralda Swartz, CMO of MetraTech, discussed how for this to be effective you not only need to have infrastructure and operational models capable of utilizing this new phenomenon, but increasingly service providers will need to convince a skeptical public to participate. Get ready to show them the money!
SYS-CON Media announced that Centrify, a provider of unified identity management across cloud, mobile and data center environments that delivers single sign-on (SSO) for users and a simplified identity infrastructure for IT, has launched an ad campaign on Cloud Computing Journal. The ads focus on security: how an organization can successfully control privilege for all of the organization’s identities to mitigate identity-related risk without slowing down the business, and how Centrify provides ...
The Internet of Things will put IT to its ultimate test by creating infinite new opportunities to digitize products and services, generate and analyze new data to improve customer satisfaction, and discover new ways to gain a competitive advantage across nearly every industry. In order to help corporate business units to capitalize on the rapidly evolving IoT opportunities, IT must stand up to a new set of challenges. In his session at @ThingsExpo, Jeff Kaplan, Managing Director of THINKstrateg...
"We help companies that are using a lot of Software as a Service. We help companies manage and gain visibility into what people are using inside the company and decide to secure them or use standards to lock down or to embrace the adoption of SaaS inside the company," explained Scott Kriz, Co-founder and CEO of Bitium, in this SYS-CON.tv interview at 15th Cloud Expo, held Nov 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA.
"SAP had made a big transition into the cloud as we believe it has significant value for our customers, drives innovation and is easy to consume. When you look at the SAP portfolio, SAP HANA is the underlying platform and it powers all of our platforms and all of our analytics," explained Thorsten Leiduck, VP ISVs & Digital Commerce at SAP, in this SYS-CON.tv interview at 15th Cloud Expo, held Nov 4-6, 2014, at the Santa Clara Convention Center in Santa Clara, CA.
One of the biggest challenges when developing connected devices is identifying user value and delivering it through successful user experiences. In his session at Internet of @ThingsExpo, Mike Kuniavsky, Principal Scientist, Innovation Services at PARC, described an IoT-specific approach to user experience design that combines approaches from interaction design, industrial design and service design to create experiences that go beyond simple connected gadgets to create lasting, multi-device exp...
SAP is delivering break-through innovation combined with fantastic user experience powered by the market-leading in-memory technology, SAP HANA. In his General Session at 15th Cloud Expo, Thorsten Leiduck, VP ISVs & Digital Commerce, SAP, discussed how SAP and partners provide cloud and hybrid cloud solutions as well as real-time Big Data offerings that help companies of all sizes and industries run better. SAP launched an application challenge to award the most innovative SAP HANA and SAP HANA...
Enthusiasm for the Internet of Things has reached an all-time high. In 2013 alone, venture capitalists spent more than $1 billion dollars investing in the IoT space. With "smart" appliances and devices, IoT covers wearable smart devices, cloud services to hardware companies. Nest, a Google company, detects temperatures inside homes and automatically adjusts it by tracking its user's habit. These technologies are quickly developing and with it come challenges such as bridging infrastructure gaps,...
"Cloud consumption is something we envision at Solgenia. That is trying to let the cloud spread to the user as a consumption, as utility computing. We want to allow the people to just pay for what they use, not a subscription model," explained Ermanno Bonifazi, CEO & Founder of Solgenia, in this SYS-CON.tv interview at Cloud Expo, held Nov 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA.
The Domain Name Service (DNS) is one of the most important components in networking infrastructure, enabling users and services to access applications by translating URLs (names) into IP addresses (numbers). Because every icon and URL and all embedded content on a website requires a DNS lookup loading complex sites necessitates hundreds of DNS queries. In addition, as more internet-enabled ‘Things' get connected, people will rely on DNS to name and find their fridges, toasters and toilets. Acco...
"For the past 4 years we have been working mainly to export. For the last 3 or 4 years the main market was Russia. In the past year we have been working to expand our footprint in Europe and the United States," explained Andris Gailitis, CEO of DEAC, in this SYS-CON.tv interview at Cloud Expo, held Nov 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA.
Connected devices and the Internet of Things are getting significant momentum in 2014. In his session at Internet of @ThingsExpo, Jim Hunter, Chief Scientist & Technology Evangelist at Greenwave Systems, examined three key elements that together will drive mass adoption of the IoT before the end of 2015. The first element is the recent advent of robust open source protocols (like AllJoyn and WebRTC) that facilitate M2M communication. The second is broad availability of flexible, cost-effective ...