|By Jeremy Geelan||
|December 1, 2008 03:00 PM EST||
Cloud-based tools, including large-scale data-intensive computing as offered by Hadoop, are key to the rise and rise of cloud computing. In this wide-ranging Exclusive Q&A with SYS-CON's Cloud Computing Journal, the Director of Grid Services at Yahoo! - Rob Weltman - explains to Jeremy Geelan, Conference Chair of SYS-CON's 1st International Cloud Computing Conference & Expo held last week in San Jose, CA, how analyzing and learning from ever-growing volumes of business data is essential to continuously refining and improving service offerings.
Cloud Computing Journal: Yahoo! has been the largest contributor to the Hadoop project and uses Hadoop extensively in its Web search and advertising businesses. Can you explain a little of the background to that?
Rob Weltman: Yahoo! Search (and before it Inktomi) was a pioneer in using large clusters of commodity computers to speed up the crawling and indexing of Web sites. While working on the architecture and design of the next generation of Web Search crawling and indexing, we came in touch with Doug Cutting and the open source Lucene project for text indexing/search. Lucene contained a distributed file system with integrated computation using the map-reduce paradigm. It looked very promising and appropriate for many data-intensive applications. Hadoop was then split out as its own project. Yahoo! supported Hadoop in a big way, both in contributing to its development as an open source project and in applying it to solve many large-scale data/computation problems in the company.
Hadoop has matured at an amazingly fast pace. From a 20-node cluster two years ago, to many 2,000-node clusters today; from a somewhat embarrassing terasort (a benchmark) performance to the terasort leader; from a no-access control to user- and group-owned files and directories. There is now a high-level language - Pig - that allows you to express complex operations on data in an intuitive way and have them translated into Hadoop map-reduce jobs.
In 2007, Hadoop at Yahoo! was used primarily for research - analyzing enormous volumes of data to find the best algorithms and parameters for selecting search results or ads to present to users. Now it is also a central component in many production operations, including Web Search, ad serving, and personalization.
Cloud Computing Journal: Are cloud-based tools like Hadoop the most important kinds of tools for the future, do you think?
RW: Being able to add capacity as needed without major software or infrastructure changes is clearly important for many organizations. Sharing resources and dynamically allocating more or less to various functions on demand is highly attractive as companies strive to control costs while the computing needs grow and shift. Analyzing and learning from ever-growing volumes of business data is essential to continuously refining and improving service offerings. The ability to quickly explore new algorithms and put them into production will be a competitive advantage for those with the resources to apply them. All of these speak to the importance of Cloud Computing,
Cloud Computing Journal: How important a role does Java play in the project? Is that because of the need to scale horizontally (and massively)?
RW: Hadoop supports programming and scripting in many languages. Hadoop, itself, is written in Java. The language provides strong support for the central infrastructure needs of system and network programming. There is a large body of experience in developing robust, performance-optimized, scalable platforms in Java.
Java provides portability to many hardware and software environments however Hadoop's horizontal scalability is not a result of the choice of language but rather of a design that is strongly focused on fault-tolerance and distribution.
Cloud Computing Journal: Is the Yahoo! Search Webmap still the world's largest Hadoop production application so far as you are aware? Can you share some size data about Webmap with us?
RW: Yes, as far as I know, the Yahoo! WebMap is the largest Hadoop application in production. It uses 2,000+ computers and is still continuously growing. It produces 300TB of data per run, including 1.2 trillion links.
Cloud Computing Journal: How important are Hadoop clusters to Yahoo! Overall? Do your Web search queries depend on them?
RW: Hadoop isn't directly involved in responding to queries typed in by users, but it is responsible for much of the backend work that produces the indexes used to service those queries. If the Hadoop clusters were down, the quality of search results would quickly degrade as the indexes became stale.
Cloud Computing Journal: Who else besides Yahoo! uses Hadoop to run large distributed computations?
RW: Many of the major Hadoop users are listed at http://wiki.apache.org/hadoop/PoweredBy. Facebook has several hundred nodes in a cluster for backend processing and analysis. Quantcast has several thousand cores in a very large cluster. Many companies, including AOL, A9 (Amazon), and IBM have deployed somewhat smaller clusters. It's likely that almost all of the uses involve large quantities of data.
Cloud Computing Journal: Can Hadoop be run on Amazon EC2?
RW: Absolutely! There is a ready-to-run AMI (virtual machine definition for EC2) for Hadoop. Among many others, Powerset (now owned by Microsoft) runs on EC2.
Cloud Computing Journal: What about Sun's Grid Engine - can it also be run on that?
RW: Yes, Hadoop works with Sun's Grid Engine but you lose the benefit of data locality (putting the computation of each piece of a distributed job near the data needed by that piece).
Cloud Computing Journal: Does the Hadoop team have any kind of a blog or forum?
RW: We have a blog at http://developer.yahoo.net/blogs/hadoop/. The team is also heavily engaged in the user and developer Hadoop mailing lists at hadoop.apache.org.
Cloud Computing Journal: Doug Cutting named it after his child's stuffed elephant. Is there any downside to an Enterprise IT tool having the name of a stuffed elephant?
RW: I did get some ribbing during the election period when I wore my Hadoop Summit t-shirt with the elephant on it, but I was able to clarify Hadoop's open source and non-partisan nature.
Cloud Computing Journal: What else have you and your team developed at Yahoo!, in terms of data-analytics applications for example?
RW: The Grid Computing development team at Yahoo! works on the Hadoop core software, the Pig high-level language, the ZooKeeper distributed coordination service, and the Chukwa monitoring and metric analysis system. In addition, it provides various Hadoop add-ons and tools to e.g. facilitate joining of very large data sets or to understand and improve the performance and efficiency of Hadoop jobs. We provide consulting to application teams that develop large-scale Hadoop programs (often involving feature extraction, modeling, optimization, and index creation) but do not produce them ourselves.
Until recently, many organizations required specialized departments to perform mapping and geospatial analysis, and they used Esri on-premise solutions for that work. In his session at 15th Cloud Expo, Dave Peters, author of the Esri Press book Building a GIS, System Architecture Design Strategies for Managers, will discuss how Esri has successfully included the cloud as a fully integrated SaaS expansion of the ArcGIS mapping platform. Organizations that have incorporated Esri cloud-based applications and content within their business models are reaping huge benefits by directly leveraging cloud-based mapping and analysis capabilities within their existing enterprise investments. The ArcGIS mapping platform includes cloud-based content management and information resources to more widely, efficiently, and affordably deliver real-time actionable information and analysis capabilities to your organization.
Aug. 19, 2014 02:45 PM EDT Reads: 836
In his session at 15th Cloud Expo, Mark Hinkle, Senior Director, Open Source Solutions at Citrix Systems Inc., will provide overview of the open source software that can be used to deploy and manage a cloud computing environment. He will include information on storage, networking(e.g., OpenDaylight) and compute virtualization (Xen, KVM, LXC) and the orchestration(Apache CloudStack, OpenStack) of the three to build their own cloud services. Speaker Bio: Mark Hinkle is the Senior Director, Open Source Solutions, at Citrix Systems Inc. He joined Citrix as a result of their July 2011 acquisition of Cloud.com where he was their Vice President of Community. He is currently responsible for Citrix open source efforts around the open source cloud computing platform, Apache CloudStack and the Xen Hypervisor. Previously he was the VP of Community at Zenoss Inc., a producer of the open source application, server, and network management software, where he grew the Zenoss Core project to over 10...
Aug. 17, 2014 06:00 PM EDT Reads: 1,848
Almost everyone sees the potential of Internet of Things but how can businesses truly unlock that potential. The key will be in the ability to discover business insight in the midst of an ocean of Big Data generated from billions of embedded devices via Systems of Discover. Businesses will also need to ensure that they can sustain that insight by leveraging the cloud for global reach, scale and elasticity. In his session at Internet of @ThingsExpo, Mac Devine, Distinguished Engineer at IBM, will discuss bringing these three elements together via Systems of Discover.
Aug. 17, 2014 02:30 PM EDT Reads: 3,005
As more applications and services move "to the cloud" (public or on-premise) cloud environments are increasingly adopting and building out traditional enterprise features. This in turn is enabling and encouraging cloud adoption from enterprise users. In many ways the definition is blurring as features like continuous operation, geo-distribution or on-demand capacity become the norm. NuoDB is involved in both building enterprise software and using enterprise cloud capabilities. In his session at 15th Cloud Expo, Seth Proctor, CTO at NuoDB, Inc., will discuss the experiences from building, deploying and using enterprise services and suggest some ways to approach moving enterprise applications into a cloud model.
Aug. 16, 2014 08:30 PM EDT Reads: 1,913
Cloud and Big Data present unique dilemmas: embracing the benefits of these new technologies while maintaining the security of your organization’s assets. When an outside party owns, controls and manages your infrastructure and computational resources, how can you be assured that sensitive data remains private and secure? How do you best protect data in mixed use cloud and big data infrastructure sets? Can you still satisfy the full range of reporting, compliance and regulatory requirements? In his session at 15th Cloud Expo, Derek Tumulak, Vice President of Product Management at Vormetric, will discuss how to address data security in cloud and Big Data environments so that your organization isn’t next week’s data breach headline.
Aug. 16, 2014 07:00 PM EDT Reads: 1,542
The cloud is everywhere and growing, and with it SaaS has become an accepted means for software delivery. SaaS is more than just a technology, it is a thriving business model estimated to be worth around $53 billion dollars by 2015, according to IDC. The question is – how do you build and scale a profitable SaaS business model? In his session at 15th Cloud Expo, Jason Cumberland, Vice President, SaaS Solutions at Dimension Data, will give the audience an understanding of common mistakes businesses make when transitioning to SaaS; how to avoid them; and how to build a profitable and scalable SaaS business.
Aug. 16, 2014 01:00 PM EDT Reads: 2,219
SYS-CON Events announced today that Gridstore™, the leader in software-defined storage (SDS) purpose-built for Windows Servers and Hyper-V, will exhibit at SYS-CON's 15th International Cloud Expo®, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Gridstore™ is the leader in software-defined storage purpose built for virtualization that is designed to accelerate applications in virtualized environments. Using its patented Server-Side Virtual Controller™ Technology (SVCT) to eliminate the I/O blender effect and accelerate applications Gridstore delivers vmOptimized™ Storage that self-optimizes to each application or VM across both virtual and physical environments. Leveraging a grid architecture, Gridstore delivers the first end-to-end storage QoS to ensure the most important App or VM performance is never compromised. The storage grid, that uses Gridstore’s performance optimized nodes or capacity optimized nodes, starts with as few a...
Aug. 15, 2014 06:30 PM EDT Reads: 1,465
SYS-CON Events announced today that Solgenia, the global market leader in Cloud Collaboration and Cloud Infrastructure software solutions, will exhibit at SYS-CON's 15th International Cloud Expo®, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Solgenia is the global market leader in Cloud Collaboration and Cloud Infrastructure software solutions. Designed to “Bridge the Gap” between personal and professional social, mobile and cloud user experiences, our solutions help large and medium-sized organizations dramatically improve productivity, reduce collaboration costs, and increase the overall enterprise value by bringing collaboration and infrastructure solutions to the cloud.
Aug. 15, 2014 02:00 PM EDT Reads: 1,472
Cloud computing started a technology revolution; now DevOps is driving that revolution forward. By enabling new approaches to service delivery, cloud and DevOps together are delivering even greater speed, agility, and efficiency. No wonder leading innovators are adopting DevOps and cloud together! In his session at DevOps Summit, Andi Mann, Vice President of Strategic Solutions at CA Technologies, will explore the synergies in these two approaches, with practical tips, techniques, research data, war stories, case studies, and recommendations.
Aug. 13, 2014 09:45 PM EDT Reads: 2,402
Enterprises require the performance, agility and on-demand access of the public cloud, and the management, security and compatibility of the private cloud. The solution? In his session at 15th Cloud Expo, Simone Brunozzi, VP and Chief Technologist(global role) for VMware, will explore how to unlock the power of the hybrid cloud and the steps to get there. He'll discuss the challenges that conventional approaches to both public and private cloud computing, and outline the tough decisions that must be made to accelerate the journey to the hybrid cloud. As part of the transition, an Infrastructure-as-a-Service model will enable enterprise IT to build services beyond their data center while owning what gets moved, when to move it, and for how long. IT can then move forward on what matters most to the organization that it supports – availability, agility and efficiency.
Aug. 12, 2014 10:30 PM EDT Reads: 1,619
Every healthy ecosystem is diverse. This is especially true in cloud ecosystems, where portability and interoperability are more important than old enterprise models of proprietary ownership. In his session at 15th Cloud Expo, Mark Baker, Server Product Manager at Canonical/Ubuntu, will discuss how single vendors used to take the lead in creating and delivering technology, but in a cloud economy, where users want tools of their preference, when and where they need them, it makes no sense.
Aug. 11, 2014 02:45 PM EDT Reads: 1,496
The 15th International Cloud Expo has just expanded its conference program, to bring together Cloud Computing, APM, APIs, Security, Big Data, Internet of Things, DevOps and WebRTC at one location. Cloud Expo is the single show where delegates and technology vendors can meet to experience and discuss the entire world of the cloud. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to learn about the latest technology developments and solutions.
Aug. 11, 2014 07:00 AM EDT Reads: 2,257
SYS-CON Events announced today that Bsquare Corporation, a leading enabler of smart connected systems, has been named “Bronze Sponsor” of SYS-CON's Internet of @ThingsExpo, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Bsquare is a global leader of embedded software solutions. We enable smart connected systems at the device level and beyond that millions use every day and provide actionable data solutions for the growing Internet of Things (IoT) market. We empower our world-class customers with our products, services and solutions to achieve innovation and success.
Aug. 11, 2014 06:30 AM EDT Reads: 1,888
SYS-CON Events announced today that NuoDB, Inc., the leader in webscale distributed database technology, has been named “Bronze Sponsor” of SYS-CON's 15th International Cloud Expo®, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. NuoDB was launched in 2010 by industry-renowned database architect Jim Starkey and accomplished software CEO Barry Morris to deliver a webscale distributed database management system that is specifically designed for the cloud and the modern datacenter.
Aug. 10, 2014 05:30 PM EDT Reads: 4,326
SYS-CON Events announced today that Cloudian, Inc., the leading provider of hybrid cloud storage solutions, has been named “Bronze Sponsor” of SYS-CON's 15th International Cloud Expo®, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Cloudian is a Foster City, Calif.-based software company specializing in cloud storage. Cloudian HyperStore® is an S3-compatible cloud object storage platform that enables service providers and enterprises to build reliable, affordable and scalable hybrid cloud storage solutions. Cloudian actively partners with leading cloud computing environments including Amazon Web Services, Citrix Cloud Platform, Apache CloudStack, OpenStack and the vast ecosystem of S3 compatible tools and applications. Cloudian's customers include Vodafone, Nextel, NTT, Nifty, and LunaCloud. The company has additional offices in China and Japan.
Aug. 10, 2014 04:45 PM EDT Reads: 2,092
- @ThingsExpo | ARM Server to Transform #Cloud and #BigData to #IoT
- DevOps Summit Silicon Valley Call for Papers Now Open
- WSTA Named “Association Sponsor” of Cloud Expo Silicon Valley
- @DevOpsSummit | @Docker + Stackato: The Perfect Workload Portability Solution [#DevOps]
- SaaS Represents the Commoditization of Business Function
- What DevOps Can Do About Cloud's Predictable Provisioning Problem
- Rise of the Thing - Internet of Things
- My Journey to #DevOps Enlightenment
- Network Security: Is It Time to Think Like a Thief?
- Cloud Encryption Best Practices for Financial Services
- Real-Time Fraud Detection in the Cloud
- WebRTC Summit Names Peter Dunkley "Summit Chair" At @ThingsExpo
- Direction for Software Developers in the Cloud
- @ThingsExpo | ARM Server to Transform #Cloud and #BigData to #IoT
- CiRBA Executives Speaking at Key Upcoming Industry Events
- Global Financial Firms Can Effectively Address Technology Risk Guidelines
- Eight Ways Cloud-Empowered HCM Solutions Are Driving Business Success
- WebRTC Summit Silicon Valley Call for Papers Now Open
- DevOps Summit Silicon Valley Call for Papers Now Open
- Top Five Best Practices for Your Application PaaS Audience
- WSTA Named “Association Sponsor” of Cloud Expo Silicon Valley
- PEER 1 Hosting to Exhibit at Cloud Expo New York
- WSO2 Guest Speakers at WSO2Con Europe 2014 Will Examine Technology Developments and Best Practices Enabling the Connected Business
- Call for Papers for Cloud Expo 2014 Silicon Valley Opens
- The Top 150 Players in Cloud Computing
- What is Cloud Computing?
- Six Benefits of Cloud Computing
- The Top 250 Players in the Cloud Computing Ecosystem
- Twenty-One Experts Define Cloud Computing
- What's the Difference Between Cloud Computing and SaaS?
- A Brief History of Cloud Computing: Is the Cloud There Yet?
- The Future of Cloud Computing
- Cloud Computing Expo 2009 West: Call for Papers Now Closed
- Cloud People: A Who's Who of Cloud Computing
- Virtualization Conference Keynote Webcast Live on SYS-CON.TV
- Ulitzer Names the World's 30 Most Influential Cloud Computing Bloggers