|By Srinivasan Sundara Rajan||
|March 14, 2013 11:15 AM EDT||
Data Warehouse as a Service
Recently Amazon announced the availability of Redshift Data warehouse as a Service as a beta offering. Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. It's optimized for datasets ranging from a few hundred gigabytes to a petabyte or more and costs less than $1,000 per terabyte per year, a tenth the cost of most traditional data warehousing solutions.
Architecture Behind Redshift
Any data warehouse service meant to serve data of petabyte scale should have a robust architecture as its backbone. The following are the salient features of Redshift service.
- Shared Nothing Architecture: As indicated in one of my earlier articles, Cloud Database Scale Out Using Shared Nothing Architecture, the shared nothing architectural pattern is the most desired for databases of this scale and the same concept is adhered to in Redshift. The core component of Redshift is a cluster and each cluster consists of multiple compute nodes, each node has its dedicated storage following the shared nothing principle.
- Massively Parallel Processing (MPP): Hand in hand with the shared nothing pattern MPP provides horizontal scale out capabilities for large data warehouses rather than scaling up the individual servers. Massively parallel processing (MPP) enables fast execution of the most complex queries operating on large amounts of data. Multiple compute nodes handle all query processing leading up to the final result aggregation, with each core of each node executing the same compiled query segments on portions of the entire data. With the concept of NodeSlices Redshift has taken the MPP to the next level to the cores of a compute node. A compute node is partitioned into slices; one slice for each core of the node's multi-core processor. Each slice is allocated a portion of the node's memory and disk space, where it processes a portion of the workload assigned to the node.
Refer to the following diagram from AWS Documentation, about Data warehouse system architecture
- Columnar Data Storage: Storing database table information in a columnar fashion reduces the number of disk I/O requests and reduces the amount of data you need to load from disk. Columnar storage for database tables drastically reduces the overall disk I/O requirements and is an important factor in optimizing analytic query performance.
- Leader Node: The leader node manages most communications with client programs and all communication with compute nodes. It parses and develops execution plans to carry out database operations, in particular, the series of steps necessary to obtain results for complex queries. Based on the execution plan, the leader node distributes compiled code to the compute nodes and assigns a portion of the data to each compute node.
- High Speed Network Connect: The clusters are connected internally by a 10 Gigabit Ethernet network, providing very fast communication between the leader node and the compute clusters.
Best Practices in Application Design on Redshift
The enablement of Big Data analytics through Redshift has created lot of excitement among the community. The usage of these kinds of alternate approaches to traditional data warehousing will be best in conjunction with the best practices for utilizing the features. The following are some of the best practices that can be considered for the design of applications on Redshift.
1. Collocated Tables: It is good practice to try to avoid sending data between the nodes to satisfy JOIN queries. Colocation between two joined tables occurs when the matching rows of the two tables are stored in the same compute nodes, so that the data need not be sent between nodes.
When you add data to a table, Amazon Redshift distributes the rows in the table to the cluster slices using one of two methods:
- Even distribution
- Key distribution
Even distribution is the default distribution method. With even distribution, the leader node spreads data rows across the slices in a round-robin fashion, regardless of the values that exist in any particular column. This approach is a good choice when you don't have a clear option for a distribution key.
If you specify a distribution key when you create a table, the leader node distributes the data rows to the slices based on the values in the distribution key column. Matching values from the distribution key column are stored together.
Colocation is best achieved by choosing the appropriate distribution keys than using the even distribution.
If you frequently join a table, specify the join column as the distribution key. If a table joins with multiple other tables, distribute on the foreign key of the largest dimension that the table joins with. If the dimension tables are filtered as part of the joins, compare the size of the data after filtering when you choose the largest dimension. This ensures that the rows involved with your largest joins will generally be distributed to the same physical nodes. Because local joins avoid data movement, they will perform better than network joins.
2. De-Normalization: In the traditional RDBMS, database storage is optimized by applying the normalization principles such that a particular attribute (column) is associated with one and only entity (Table). However in shared nothing scalable databases like Redshift this technique will not yield the desired results, rather keeping the redundancy of certain columns in the form of de-normalization is very important.
For example, the following query is one of the examples of a high performance query in the Redshift documentation.
SELECT * FROM tab1, tab2
WHERE tab1.key = tab2.key
AND tab1.timestamp > ‘1/1/2013'
AND tab2.timestamp > ‘1/1/2013';
Even if a predicate is already being applied on a table in a join query but transitively applies to another table in the query, it's useful to re-specify the redundant predicate if that other table is also sorted on the column in the predicate. That way, when scanning the other table, Redshift can efficiently skip blocks from that table as well.
By carefully applying de-normalization to bring the required redundancy, Amazon Redshift can perform at its best.
3. Native Parallelism: One of the biggest advantages of a shared nothing MPP architecture is about parallelism. Parallelism is achieved in multiple ways.
- Inter Node Parallelism: It refers the ability of the database system to break up a query into multiple parts across multiple instances across the cluster.
- Intra Node Parallelism: Intra node parallelism refers to the ability to break up query into multiple parts within a single compute node.
Typically in MPP architectures, both Inter Node Parallelism and Intra Node Parallelism will be combined and used at the same time to provide dramatic performance gains.
Amazon Redshift provides lot of operations to utilize both Intra Node and Inter Node parallelism.
When you use a COPY command to load data from Amazon S3, first split your data into multiple files instead of loading all the data from a single large file.
The COPY command then loads the data in parallel from multiple files, dividing the workload among the nodes in your cluster. Split your data into files so that the number of files is a multiple of the number of slices in your cluster. That way Amazon Redshift can divide the data evenly among the slices. Name each file with a common prefix. For example, each XL compute node has two slices, and each 8XL compute node has 16 slices. If you have a cluster with two XL nodes, you might split your data into four files named customer_1, customer_2, customer_3, and customer_4. Amazon Redshift does not take file size into account when dividing the workload, so make sure the files are roughly the same size.
Pre-Processing Data: Over the years RDBMS engines take pride of Location Independence. The Codd's 12 rules of the RDBMS states the following:
Rule 8: Physical data independence:
Changes to the physical level (how the data is stored, whether in arrays or linked lists, etc.) must not require a change to an application based on the structure.
However, in the columnar database services like Redshift the physical ordering of data does make major impact to the performance.
Sorting data is a mechanism for optimizing query performance.
When you create a table, you can define one or more of its columns as the sort key. When data is loaded into the table, the values in the sort key column (or columns) are stored on disk in sorted order. Information about sort key columns is passed to the query planner, and the planner uses this information to construct plans that exploit the way that the data is sorted. For example, a merge join, which is often faster than a hash join, is feasible when the data is distributed and presorted on the joining columns.
The VACUUM command also makes sure that new data in tables is fully sorted on disk. Vacuum as often as you need to in order to maintain a consistent query performance.
Platform as a Service (PaaS) is one of the greatest benefits to the IT community due to the Cloud Delivery Model, and from the beginning of pure play programming models like Windows Azure and Elastic Beanstalk it has moved to high-end services like data warehouse Platform as a Service. As the industry analysts see good adoption of the above service due to the huge cost advantages when compared to the traditional data warehouse platform, the best practices mentioned above will help to achieve the desired level of performance. Detailed documentation is also available on the vendor site in the form of developer and administrator guides.
SYS-CON Media announced that Splunk, a provider of the leading software platform for real-time Operational Intelligence, has launched an ad campaign on Big Data Journal. Splunk software and cloud services enable organizations to search, monitor, analyze and visualize machine-generated big data coming from websites, applications, servers, networks, sensors and mobile devices. The ads focus on delivering ROI - how improved uptime delivered $6M in annual ROI, improving customer operations by minin...
Jan. 31, 2015 02:00 PM EST Reads: 5,941
The move in recent years to cloud computing services and architectures has added significant pace to the application development and deployment environment. When enterprise IT can spin up large computing instances in just minutes, developers can also design and deploy in small time frames that were unimaginable a few years ago. The consequent move toward lean, agile, and fast development leads to the need for the development and operations sides to work very closely together. Thus, DevOps become...
Jan. 31, 2015 02:00 PM EST Reads: 4,510
Puppet Labs on Wednesday released the DevOps Salary Report, based on salary data gathered from Puppet Labs' industry-recognized State of DevOps Report. The data confirms that market demand for DevOps skills is growing, and that DevOps engineers are among the highest paid IT practitioners today. That's because IT organizations today are grappling with how to be more agile and responsive to the business, while maintaining the stability of their infrastructure. DevOps practices, such as continuous ...
Jan. 31, 2015 02:00 PM EST Reads: 2,475
CloudBees, Inc., has announced a $23.5 million financing round, led by longtime CloudBees investor Lightspeed Venture Partners. Existing investors Matrix Partners, Verizon Ventures and Blue Cloud Ventures also participated in the round. The latest funding announcement follows earlier rounds of $4 million, $10.5 million and $10.8 million, bringing the total investment in CloudBees to just under $50 million since the company’s inception in 2010. Previous venture investment rounds were led by Ma...
Jan. 31, 2015 02:00 PM EST Reads: 1,463
“We are strong believers in the DevOps movement and our staff has been doing DevOps for large enterprise environments for a number of years. The solution that we build is intended to allow DevOps teams to do security at the speed of DevOps," explained Justin Lundy, Founder & CTO of Evident.io, in this SYS-CON.tv interview at DevOps Summit, held Nov 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA.
Jan. 31, 2015 02:00 PM EST Reads: 3,605
The cloud is becoming the de-facto way for enterprises to leverage common infrastructure while innovating and one of the biggest obstacles facing public cloud computing is security. In his session at 15th Cloud Expo, Jeff Aliber, a global marketing executive at Verizon, discussed how the best place for web security is in the cloud. Benefits include: Functions as the first layer of defense Easy operation –CNAME change Implement an integrated solution Best architecture for addressing network-l...
Jan. 31, 2015 01:30 PM EST Reads: 3,683
DevOps Summit 2015 New York, co-located with the 16th International Cloud Expo - to be held June 9-11, 2015, at the Javits Center in New York City, NY - announces that it is now accepting Keynote Proposals. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long development cycles that produce software that is obsolete...
Jan. 31, 2015 01:15 PM EST Reads: 4,460
In this scenarios approach Joe Thykattil, Technology Architect & Sales at TimeWarner / Navisite, presented examples that will allow business-savvy professionals to make informed decisions based on a sound business model. This model covered the technology options in detail as well as a financial analysis. The TCO (Total Cost of Ownership) and ROI (Return on Investment) demonstrated how to start, develop and formulate a business case that will allow both small and large scale projects to achieve...
Jan. 31, 2015 01:00 PM EST Reads: 3,621
IBM has announced a new strategic technology services agreement with Anthem, Inc., a health benefits company in the U.S. IBM has been selected to provide operational services for Anthem's mainframe and data center server and storage infrastructure for the next five years. Among the benefits of the relationship, Anthem has the ability to leverage IBM Cloud solutions that will help increase the ease, availability and speed of adding infrastructure to support new business requirements.
Jan. 31, 2015 01:00 PM EST Reads: 2,598
Today, IT is not just a cost center. IT is an enabler and driver of business. With the emergence of the hybrid cloud paradigm, IT now has increasingly more capabilities to create new strategic opportunities for a business. Hybrid cloud allows an organization to utilize multi-tenant public clouds, dedicated private clouds, bare metal hosting, and the associated support and services for the right use cases through an on-demand, XaaS model. This model of IT creates tremendous opportunities for busi...
Jan. 31, 2015 01:00 PM EST Reads: 2,633
Cloud computing started a technology revolution; now DevOps is driving that revolution forward. By enabling new approaches to service delivery, cloud and DevOps together are delivering even greater speed, agility, and efficiency. No wonder leading innovators are adopting DevOps and cloud together! In his session at DevOps Summit, Andi Mann, Vice President of Strategic Solutions at CA Technologies, explored the synergies in these two approaches, with practical tips, techniques, research data, wa...
Jan. 31, 2015 01:00 PM EST Reads: 4,376
Software AG and Wipro Ltd. have announced a joint solution platform for streaming analytics that provides real-time actionable intelligence for the Internet of Things (IoT) market. “The key to successfully addressing the IoT market is the ability to rapidly build and evolve apps that tap into, analyze and make smart decisions on fast, big data”, said John Bates, Global Head of Industry Solutions and CMO, Software AG. To address the huge market potential created by streaming analytics in conj...
Jan. 31, 2015 01:00 PM EST Reads: 1,878
Appcore deploys cloud for service providers based on the Apache Cloud set. In this demo at 15th Cloud Expo, Nate Gordon, Director of Technology at Appcore, shows their new product that's coming out in January - Appcore Atlas, which is focused on deploying private clouds based on CloudStack in 15 minutes or less. Our upcoming June 9-11, 2015, event in New York City will present a total of 10 simultaneous tracks (the largest conference content in the world) by an all-star faculty, over three days...
Jan. 31, 2015 01:00 PM EST Reads: 2,627
The term culture has had a polarizing effect among DevOps supporters. Some propose that culture change is critical for success with DevOps, but are remiss to define culture. Some talk about a DevOps culture but then reference activities that could lead to culture change and there are those that talk about culture change as a set of behaviors that need to be adopted by those in IT. There is no question that businesses successful in adopting a DevOps mindset have seen departmental culture change, ...
Jan. 31, 2015 12:45 PM EST Reads: 3,998
In a world of ever-accelerating business cycles and fast-changing client expectations, the cloud increasingly serves as a growth engine and a path to new business models. Dynamic clouds enable businesses to continuously reinvent themselves, adapting their business processes, their service and software delivery and their operations to achieve speed-to-market and quick response to customer feedback. As the cloud evolves, the industry has multiple competing cloud technologies, offering on-premises ...
Jan. 31, 2015 12:15 PM EST Reads: 3,910
SYS-CON Events announced today that that Innodisk, the service-driven provider of industrial embedded flash and DRAM storage products and technologies, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Innodisk is a service-driven provider of industrial embedded flash and DRAM storage products and technologies. With satisfied customers across the embedded, aerospace and defense, cloud storage markets an...
Jan. 31, 2015 12:00 PM EST Reads: 1,362
Connected devices and the Internet of Things are getting significant momentum in 2014. In his session at Internet of @ThingsExpo, Jim Hunter, Chief Scientist & Technology Evangelist at Greenwave Systems, examined three key elements that together will drive mass adoption of the IoT before the end of 2015. The first element is the recent advent of robust open source protocols (like AllJoyn and WebRTC) that facilitate M2M communication. The second is broad availability of flexible, cost-effective ...
Jan. 31, 2015 12:00 PM EST Reads: 4,664
The Internet of Things will put IT to its ultimate test by creating infinite new opportunities to digitize products and services, generate and analyze new data to improve customer satisfaction, and discover new ways to gain a competitive advantage across nearly every industry. In order to help corporate business units to capitalize on the rapidly evolving IoT opportunities, IT must stand up to a new set of challenges. In his session at @ThingsExpo, Jeff Kaplan, Managing Director of THINKstrateg...
Jan. 31, 2015 11:45 AM EST Reads: 4,722
"Our premise is Docker is not enough. That's not a bad thing - we actually love Docker. At ActiveState all our products are based on open source technology and Docker is an up-and-coming piece of open source technology," explained Bart Copeland, President & CEO of ActiveState Software, in this SYS-CON.tv interview at DevOps Summit at Cloud Expo®, held Nov 4-6, 2014, at the Santa Clara Convention Center in Santa Clara, CA.
Jan. 31, 2015 11:45 AM EST Reads: 4,294
Eighty-five percent of companies store information in some sort of unstructured manner. In this demo at 15th Cloud Expo, Mark Fronczak, Product Manager at Solgenia, discussed their enterprise content management solution, which was created to help companies organize and take control of their digital assets.
Jan. 31, 2015 11:30 AM EST Reads: 3,017