Welcome!

Cloud Expo Authors: Pat Romanski, Yeshim Deniz, Elizabeth White, Liz McMillan, Unitiv Blog

Related Topics: Cloud Expo, SOA & WOA, Virtualization, Web 2.0, Apache, Security

Cloud Expo: Article

Arrival of Big Data Opens Up a New Range of Analytics

It's happening: Hadoop and SQL worlds are converging

With Strata, IBM IOD, and Teradata Partners conferences all occurring this week, it’s not surprising that this is a big week for Hadoop-related announcements. The common thread of announcements is essentially, “We know that Hadoop is not known for performance, but we’re getting better at it, and we’re going to make it look more like SQL.” In essence, Hadoop and SQL worlds are converging, and you’re going to be able to perform interactive BI analytics on it.

The opportunity and challenge of Big Data from new platforms such as Hadoop is that it opens a new range of analytics. On one hand, Big Data analytics have updated and revived programmatic access to data, which happened to be the norm prior to the advent of SQL. There are plenty of scenarios where taking programmatic approaches are far more efficient, such as dealing with time series data or graph analysis to map many-to-many relationships.

It also leverages in-memory data grids such as Oracle Coherence, IBM WebSphere eXtreme Scale, GigaSpaces and others, and, where programmatic development (usually in Java) proved more efficient for accessing highly changeable data for web applications where traditional paths to the database would have been I/O-constrained. Conversely Advanced SQL platforms such as Greenplum and Teradata Aster have provided support for MapReduce-like programming because, even with structured data, sometimes using a Java programmatic framework is a more efficient way to rapidly slice through volumes of data.

But when you talk analytics, you can’t simply write off the legions of SQL developers that populate enterprise IT shops.

Until now, Hadoop has not until now been for the SQL-minded. The initial path was, find someone to do data exploration inside Hadoop, but once you’re ready to do repeatable analysis, ETL (or ELT) it into a SQL data warehouse. That’s been the pattern with Oracle Big Data Appliance (use Oracle loader and data integration tools), and most Advanced SQL platforms; most data integration tools provide Hadoop connectors that spawn their own MapReduce programs to ferry data out of Hadoop. Some integration tool providers, like Informatica, offer tools to automate parsing of Hadoop data. Teradata Aster and Hortonworks have been talking up the potentials of HCatalog, in actuality an enhanced version of Hive with RESTful interfaces, cost optimizers, and so on, to provide a more SQL friendly view of data residing inside Hadoop.

But when you talk analytics, you can’t simply write off the legions of SQL developers that populate enterprise IT shops. And beneath the veneer of chaos, there is an implicit order to most so-called “unstructured” data that is within the reach programmatic transformation approaches that in the long run could likely be automated or packaged inside a tool.

At Ovum, we have long believed that for Big Data to crossover to the mainstream enterprise, that it must become a first-class citizen with IT and the data center. The early pattern of skunk works projects, led by elite, highly specialized teams of software engineers from Internet firms to solve Internet-style problems (e.g., ad placement, search optimization, customer online experience, etc.) are not the problems of mainstream enterprises. And neither is the model of recruiting high-priced talent to work exclusively on Hadoop sustainable for most organizations; such staffing models are not sustainable for mainstream enterprises. It means that Big Data must be consumable by the mainstream of SQL developers.

Making Hadoop more SQL-like is hardly new

Hive and Pig became Apache Hadoop projects because of the need for SQL-like metadata management and data transformation languages, respectively; HBase emerged because of the need for a table store to provide a more interactive face – although as a very sparse, rudimentary column store, does not provide the efficiency of an optimized SQL database (or the extreme performance of some columnar variants). Sqoop in turn provides a way to pipeline SQL data into Hadoop, a use case that will grow more common as organizations look to Hadoop to provide scalable and cheaper storage than commercial SQL. While these Hadoop subprojects that did not exactly make Hadoop look like SQL, they provided building blocks from which many of this week’s announcements leverage.

Progress marches on

One train of thought is that if Hadoop can look more like a SQL database, more operations could be performed inside Hadoop. That’s the theme behind Informatica’s long-awaited enhancement of its PowerCenter transformation tool to work natively inside Hadoop. Until now, PowerCenter could extract data from Hadoop, but the extracts would have to be moved to a staging server where the transformation would be performed for loading to the familiar SQL data warehouse target. The new offering, PowerCenter Big Data Edition, now supports an ELT pattern that uses the power of MapReduce processes inside Hadoop to perform transformations. The significance is that PowerCenter users now have a choice: load the transformed data to HBase, or continue loading to SQL.

There is growing support for packaging Hadoop inside a common hardware appliance with Advanced SQL. EMC Greenplum was the first out of gate with DCA (Data Computing Appliance) that bundles its own distribution of Apache Hadoop (not to be confused with Greenplum MR, a software only product that is accompanied by a MapR Hadoop distro).

Teradata Aster has just joined the fray with Big Analytics Appliance, bundling the Hortonworks Data Platform Hadoop; this move was hardly surprising given their growing partnership around HCatalog, an enhancement of the SQL-like Hive metadata layer of Hadoop that adds features such as a cost optimizer and RESTful interfaces that make the metadata accessible without the need to learn MapReduce or Java. With HCatalog, data inside Hadoop looks like another Aster data table.

Not coincidentally, there is a growing array of analytic tools that are designed to execute natively inside Hadoop. For now they are from emerging players like Datameer (providing a spreadsheet-like metaphor; which just announced an app store-like marketplace for developers), Karmasphere (providing an application develop tool for Hadoop analytic apps), or a more recent entry, Platfora (which caches subsets of Hadoop data in memory with an optimized, high performance fractal index).

Yet, even with Hadoop analytic tooling, there will still be a desire to disguise Hadoop as a SQL data store, and not just for data mapping purposes.

Yet, even with Hadoop analytic tooling, there will still be a desire to disguise Hadoop as a SQL data store, and not just for data mapping purposes. Hadapt has been promoting a variant where it squeezes SQL tables inside HDFS file structures – not exactly a no-brainer as it must shoehorn tables into a file system with arbitrary data block sizes. Hadapt’s approach sounds like the converse of object-relational stores, but in this case, it is dealing with a physical rather than a logical impedance mismatch.

Hadapt promotes the ability to query Hadoop directly using SQL. Now, so does Cloudera. It has just announced Impala, a SQL-based alternative to MapReduce for querying the SQL-like Hive metadata store, supporting most but not all forms of SQL processing (based on SQL 92; Impala lacks triggers, which Cloudera deems low priority). Both Impala and MapReduce rely on parallel processing, but that’s where the similarity ends. MapReduce is a blunt instrument, requiring Java or other programming languages; it splits a job into multiple, concurrently, pipelined tasks where, at each step along the way, reads data, processes it, and writes it back to disk and then passes it to the next task.

Conversely, Impala takes a shared nothing, MPP approach to processing SQL jobs against Hive; using HDFS, Cloudera claims roughly 4x performance against MapReduce; if the data is in HBase, Cloudera claims performance multiples up to a factor of 30. For now, Impala only supports row-based views, but with columnar (on Cloudera’s roadmap), performance could double. Cloudera plans to release a real-time query (RTQ) offering that, in effect, is a commercially supported version of Impala.

By contrast, Teradata Aster and Hortonworks promote a SQL MapReduce approach that leverages HCatalog, an incubating Apache project that is a superset of Hive that Cloudera does not currently include in its roadmap. For now, Cloudera claims bragging rights for performance with Impala; over time, Teradata Aster will promote the manageability of its single appliance, and with the appliance has the opportunity to counter with hardware optimization.

The road to SQL/programmatic convergence

Either way – and this is of interest only to purists – any SQL extension to Hadoop will be outside the Hadoop project. But again, that’s an argument for purists. What’s more important to enterprises is getting the right tool for the job – whether it is the flexibility of SQL or raw power of programmatic approaches.

SQL convergence is the next major battleground for Hadoop. Cloudera is for now shunning HCatalog, an approach backed by Hortonworks and partner Teradata Aster. The open question is whether Hortonworks can instigate a stampede of third parties to overcome Cloudera’s resistance. It appears that beyond Hive, the SQL face of Hadoop will become a vendor-differentiated layer.

Part of conversion will involve a mix of cross-training and tooling automation. Savvy SQL developers will cross train to pick up some of the Java- or Java-like programmatic frameworks that will be emerging. Tooling will help lower the bar, reducing the degree of specialized skills necessary.

And for programming frameworks, in the long run, MapReduce won’t be the only game in town. It will always be useful for large-scale jobs requiring brute force, parallel, sequential processing. But the emerging YARN framework, which deconstructs MapReduce to generalize the resource management function, will provide the management umbrella for ensuring that different frameworks don’t crash into one another by trying to grab the same resources. But YARN is not yet ready for primetime – for now it only supports the batch job pattern of MapReduce. And that means that YARN is not yet ready for Impala or vice versa.

Either way – and this is of interest only to purists – any SQL extension to Hadoop will be outside the Hadoop project. But again, that’s an argument for purists.

Of course, mainstreaming Hadoop – and Big Data platforms in general – is more than just a matter of making it all look like SQL. Big Data platforms must be manageable and operable by the people who are already in IT; they will need some new skills and grow accustomed to some new practices (like exploratory analytics), but the new platforms must also look and act familiar enough. Not all announcements this week were about SQL; for instance, MapR is throwing a gauntlet to the Apache usual suspects by extending its management umbrella beyond the proprietary NFS-compatible file system that is its core IP to the MapReduce framework and HBase, making a similar promise of high performance.

On the horizon, EMC Isilon and NetApp are proposing alternatives promising a more efficient file system but at the “cost” of separating the storage from the analytic processing. And at some point, the Hadoop vendor community will have to come to grips with capacity utilization issues, because in the mainstream enterprise world, no CFO will approve the purchase of large clusters or grids that get only 10 – 15 percent utilization. Keep an eye on VMware’s Project Serengeti.

They must be good citizens in data centers that need to maximize resource (e.g., virtualization, optimized storage); must comply with existing data stewardship policies and practices; and must fully support existing enterprise data and platform security practices. These are all topics for another day.

You may also be interested in:

More Stories By Tony Baer

Tony Baer is Principal Analyst with Ovum, leading Ovum’s research on the software lifecycle. Working in concert with other members of Ovum’s software group, his research covers the full lifecycle from design and development to deployment and management. Areas of focus include application lifecycle management, software development methodologies (including agile), SOA, IT service management/ITIL, and IT management/governance.

Baer has been a noted authority on software development platforms and integration architecture for nearly 20 years. Prior to joining Ovum, he was an independent analyst whose company ‘onStrategies’ delivered software development and integration tools to vendors with technology assessment and market positioning services. He also led Computerwire’s CIO Agenda and Computer Finance end-user best practices research services.

Follow him on Twitter @TonyBaer or read his blog site www.onstrategies.com/blog.

Cloud Expo Breaking News
After a couple of false starts, cloud-based desktop solutions are picking up steam, driven by trends such as BYOD and pervasive high-speed connectivity. In his session at 15th Cloud Expo, Seth Bostock, CEO of IndependenceIT, cuts through the hype and the acronyms, and discusses the emergence of full-featured cloud workspaces that do for the desktop what cloud infrastructure did for the server. He’ll discuss VDI vs DaaS, implementation strategies and evaluation criteria.
SYS-CON Events announced today that Esri has been named “Bronze Sponsor” of SYS-CON's 15th International Cloud Expo®, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Esri inspires and enables people to positively impact the future through a deeper, geographic understanding of the changing world around them. For more information, visit http://www.esri.com.
Cloud computing started a technology revolution; now DevOps is driving that revolution forward. By enabling new approaches to service delivery, cloud and DevOps together are delivering even greater speed, agility, and efficiency. No wonder leading innovators are adopting DevOps and cloud together! In his session at DevOps Summit, Andi Mann, Vice President of Strategic Solutions at CA Technologies, will explore the synergies in these two approaches, with practical tips, techniques, research data, war stories, case studies, and recommendations.
Cloud Computing is evolving into a Big Three of Amazon Web Services, Google Cloud, and Microsoft Azure. Cloud 360: Multi-Cloud Bootcamp, being held Nov 4–5, 2014, in conjunction with 15th Cloud Expo in Santa Clara, CA, delivers a real-world demonstration of how to deploy and configure a scalable and available web application on all three platforms. The Cloud 360 Bootcamp, led by Janakiram MSV, an analyst with Gigaom Research, is the first bootcamp that introduces the core concepts of Infrastructure as a Service (IaaS) based on the workings of the Big Three platforms – Amazon EC2, Google Compute Engine, and Azure VMs. Bootcamp attendees will get to see the big picture and also receive the knowledge needed to make the best cloud decisions for their business applications and entire enterprise IT organization.
“Distrix fits into the overall cloud and IoT model around software-defined networking. There’s a broad category around software-defined networking that’s focused on data center, and we focus on the WAN,” explained Jay Friedman, President of Distrix, in this SYS-CON.tv interview at the Internet of @ThingsExpo, held June 10-12, 2014, at the Javits Center in New York City. Internet of @ThingsExpo 2014 Silicon Valley, November 4–6, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading IoT industry players in the world.
The Internet of Things promises to transform businesses (and lives), but navigating the business and technical path to success can be difficult to understand. In his session at 15th Internet of @ThingsExpo, Chad Jones, Vice President, Product Strategy of LogMeIn's Xively IoT Platform, will show you how to approach creating broadly successful connected customer solutions using real world business transformation studies including New England BioLabs and more.
Scott Jenson leads a project called The Physical Web within the Chrome team at Google. Project members are working to take the scalability and openness of the web and use it to talk to the exponentially exploding range of smart devices. Nearly every company today working on the IoT comes up with the same basic solution: use my server and you'll be fine. But if we really believe there will be trillions of these devices, that just can't scale. We need a system that is open a scalable and by using the URL as a basic building block, we open this up and get the same resilience that the web enjoys.
“The Internet of Things is a wave that has arrived and it’s growing really fast. The concern at Aria Systems is making sure that people understand the ramifications of their attempts to monetize whatever it is they build on the Internet of Things," explained C Brendan O’Brien, Co-founder and Chief Architect at Aria Systems, in this SYS-CON.tv interview at the Internet of @ThingsExpo, held June 10-12, 2014, at the Javits Center in New York City. Internet of @ThingsExpo 2014 Silicon Valley, November 4–6, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading IoT industry players in the world.
The Internet of Things is a natural complement to the cloud and related technologies such as Big Data, analytics, and mobility. In his session at Internet of @ThingsExpo, Joe Weinman will lay out four generic strategies – digital disciplines – to exploit emerging digital technologies for strategic advantage. Joe Weinman has held executive leadership positions at Bell Labs, AT&T, Hewlett-Packard, and Telx, in areas such as corporate strategy, business development, product management, operations, and R&D.
SYS-CON Events announced today that DevOps.com has been named “Media Sponsor” of SYS-CON's “DevOps Summit at Cloud Expo,” which will take place on June 10–12, 2014, at the Javits Center in New York City, New York. DevOps.com is where the world meets DevOps. It is the largest collection of original content relating to DevOps on the web today Featuring up-to-the-minute news, feature stories, blogs, bylined articles and more, DevOps.com is where the thought leaders of the DevOps movement make their ideas known.
There are 182 billion emails sent every day, generating a lot of data about how recipients and ISPs respond. Many marketers take a more-is-better approach to stats, preferring to have the ability to slice and dice their email lists based numerous arbitrary stats. However, fundamentally what really matters is whether or not sending an email to a particular recipient will generate value. Data Scientists can design high-level insights such as engagement prediction models and content clusters that allow marketers to cut through the noise and design their campaigns around strong, predictive signals, rather than arbitrary statistics. SendGrid sends up to half a billion emails a day for customers such as Pinterest and GitHub. All this email adds up to more text than produced in the entire twitterverse. We track events like clicks, opens and deliveries to help improve deliverability for our customers – adding up to over 50 billion useful events every month. While SendGrid data covers only abo...
SYS-CON Events announced today that the Web Host Industry Review has been named “Media Sponsor” of SYS-CON's 15th International Cloud Expo®, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Since 2000, The Web Host Industry Review has made a name for itself as the foremost authority of the Web hosting industry providing reliable, insightful and comprehensive news, reviews and resources to the hosting community. TheWHIR Blogs provides a community of expert industry perspectives. The Web Host Industry Review Magazine also offers a business-minded, issue-driven perspective of interest to executives and decision-makers. WHIR TV offers on demand web hosting video interviews and web hosting video features of the key persons and events of the web hosting industry. WHIR Events brings together like-minded hosting industry professionals and decision-makers in local communities. TheWHIR is an iNET Interactive property.
SYS-CON Events announced today that O'Reilly Media has been named “Media Sponsor” of SYS-CON's 15th International Cloud Expo®, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. O'Reilly Media spreads the knowledge of innovators through its books, online services, magazines, and conferences. Since 1978, O'Reilly Media has been a chronicler and catalyst of cutting-edge development, homing in on the technology trends that really matter and spurring their adoption by amplifying "faint signals" from the alpha geeks who are creating the future. An active participant in the technology community, the company has a long history of advocacy, meme-making, and evangelism.
SYS-CON Events announced today that Verizon has been named “Gold Sponsor” of SYS-CON's 15th International Cloud Expo®, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Verizon Enterprise Solutions creates global connections that generate growth, drive business innovation and move society forward. With industry-specific solutions and a full range of global wholesale offerings provided over the company's secure mobility, cloud, strategic networking and advanced communications platforms, Verizon Enterprise Solutions helps open new opportunities around the world for innovation, investment and business transformation. Visit verizonenterprise.com to learn more.
SYS-CON Events announced today that TMCnet has been named “Media Sponsor” of SYS-CON's 15th International Cloud Expo®, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Technology Marketing Corporation (TMC) is the world's leading business to business and integrated marketing media company, servicing niche markets within the communications and technology industries.