Welcome!

@CloudExpo Authors: Jim Malone, Christopher Keene, Roger Strukhoff, David Sprott, Craig Lowell

Related Topics: @CloudExpo, Java IoT, Microservices Expo, Containers Expo Blog, Agile Computing, @BigDataExpo, SDN Journal

@CloudExpo: Article

Best Practices for Amazon Redshift

Data Warehouse Analytics as a Service

Data Warehouse as a Service
Recently Amazon announced the availability of Redshift Data warehouse as a Service as a beta offering. Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. It's optimized for datasets ranging from a few hundred gigabytes to a petabyte or more and costs less than $1,000 per terabyte per year, a tenth the cost of most traditional data warehousing solutions.

Architecture Behind Redshift
Any data warehouse service meant to serve data of petabyte scale should have a robust architecture as its backbone. The following are the salient features of Redshift service.

  • Shared Nothing Architecture: As indicated in one of my earlier articles, Cloud Database Scale Out Using Shared Nothing Architecture, the shared nothing architectural pattern is the most desired for databases of this scale and the same concept is adhered to in Redshift. The core component of Redshift is a cluster and each cluster consists of multiple compute nodes, each node has its dedicated storage following the shared nothing principle.
  • Massively Parallel Processing (MPP): Hand in hand with the shared nothing pattern MPP provides horizontal scale out capabilities for large data warehouses rather than scaling up the individual servers. Massively parallel processing (MPP) enables fast execution of the most complex queries operating on large amounts of data. Multiple compute nodes handle all query processing leading up to the final result aggregation, with each core of each node executing the same compiled query segments on portions of the entire data. With the concept of NodeSlices Redshift has taken the MPP to the next level to the cores of a compute node. A compute node is partitioned into slices; one slice for each core of the node's multi-core processor. Each slice is allocated a portion of the node's memory and disk space, where it processes a portion of the workload assigned to the node.

Refer to the following diagram from AWS Documentation, about Data warehouse system architecture

  • Columnar Data Storage: Storing database table information in a columnar fashion reduces the number of disk I/O requests and reduces the amount of data you need to load from disk. Columnar storage for database tables drastically reduces the overall disk I/O requirements and is an important factor in optimizing analytic query performance.
  • Leader Node: The leader node manages most communications with client programs and all communication with compute nodes. It parses and develops execution plans to carry out database operations, in particular, the series of steps necessary to obtain results for complex queries. Based on the execution plan, the leader node distributes compiled code to the compute nodes and assigns a portion of the data to each compute node.
  • High Speed Network Connect: The clusters are connected internally by a 10 Gigabit Ethernet network, providing very fast communication between the leader node and the compute clusters.

Best Practices in Application Design on Redshift
The enablement of Big Data analytics through Redshift has created lot of excitement among the community. The usage of these kinds of alternate approaches to traditional data warehousing will be best in conjunction with the best practices for utilizing the features. The following are some of the best practices that can be considered for the design of applications on Redshift.

1. Collocated Tables: It is good practice to try to avoid sending data between the nodes to satisfy JOIN queries. Colocation between two joined tables occurs when the matching rows of the two tables are stored in the same compute nodes, so that the data need not be sent between nodes.

When you add data to a table, Amazon Redshift distributes the rows in the table to the cluster slices using one of two methods:

  • Even distribution
  • Key distribution

Even distribution is the default distribution method. With even distribution, the leader node spreads data rows across the slices in a round-robin fashion, regardless of the values that exist in any particular column. This approach is a good choice when you don't have a clear option for a distribution key.

If you specify a distribution key when you create a table, the leader node distributes the data rows to the slices based on the values in the distribution key column. Matching values from the distribution key column are stored together.

Colocation is best achieved by choosing the appropriate distribution keys than using the even distribution.

If you frequently join a table, specify the join column as the distribution key. If a table joins with multiple other tables, distribute on the foreign key of the largest dimension that the table joins with. If the dimension tables are filtered as part of the joins, compare the size of the data after filtering when you choose the largest dimension. This ensures that the rows involved with your largest joins will generally be distributed to the same physical nodes. Because local joins avoid data movement, they will perform better than network joins.

2. De-Normalization: In the traditional RDBMS, database storage is optimized by applying the normalization principles such that a particular attribute (column) is associated with one and only entity (Table). However in shared nothing scalable databases like Redshift this technique will not yield the desired results, rather keeping the redundancy of certain columns in the form of de-normalization is very important.

For example, the following query is one of the examples of a high performance query in the Redshift documentation.

SELECT * FROM tab1, tab2

WHERE tab1.key = tab2.key

AND tab1.timestamp > ‘1/1/2013'

AND tab2.timestamp > ‘1/1/2013';

Even if a predicate is already being applied on a table in a join query but transitively applies to another table in the query, it's useful to re-specify the redundant predicate if that other table is also sorted on the column in the predicate. That way, when scanning the other table, Redshift can efficiently skip blocks from that table as well.

By carefully applying de-normalization to bring the required redundancy, Amazon Redshift can perform at its best.

3. Native Parallelism: One of the biggest advantages of a shared nothing MPP architecture is about parallelism. Parallelism is achieved in multiple ways.

  • Inter Node Parallelism: It refers the ability of the database system to break up a query into multiple parts across multiple instances across the cluster.
  • Intra Node Parallelism: Intra node parallelism refers to the ability to break up query into multiple parts within a single compute node.

Typically in MPP architectures, both Inter Node Parallelism and Intra Node Parallelism will be combined and used at the same time to provide dramatic performance gains.

Amazon Redshift provides lot of operations to utilize both Intra Node and Inter Node parallelism.

When you use a COPY command to load data from Amazon S3, first split your data into multiple files instead of loading all the data from a single large file.

The COPY command then loads the data in parallel from multiple files, dividing the workload among the nodes in your cluster. Split your data into files so that the number of files is a multiple of the number of slices in your cluster. That way Amazon Redshift can divide the data evenly among the slices. Name each file with a common prefix. For example, each XL compute node has two slices, and each 8XL compute node has 16 slices. If you have a cluster with two XL nodes, you might split your data into four files named customer_1, customer_2, customer_3, and customer_4. Amazon Redshift does not take file size into account when dividing the workload, so make sure the files are roughly the same size.

Pre-Processing Data: Over the years RDBMS engines take pride of Location Independence. The Codd's 12 rules of the RDBMS states the following:

Rule 8: Physical data independence:

Changes to the physical level (how the data is stored, whether in arrays or linked lists, etc.) must not require a change to an application based on the structure.

However, in the columnar database services like Redshift the physical ordering of data does make major impact to the performance.

Sorting data is a mechanism for optimizing query performance.

When you create a table, you can define one or more of its columns as the sort key. When data is loaded into the table, the values in the sort key column (or columns) are stored on disk in sorted order. Information about sort key columns is passed to the query planner, and the planner uses this information to construct plans that exploit the way that the data is sorted. For example, a merge join, which is often faster than a hash join, is feasible when the data is distributed and presorted on the joining columns.

The VACUUM command also makes sure that new data in tables is fully sorted on disk. Vacuum as often as you need to in order to maintain a consistent query performance.

Summary
Platform as a Service (PaaS) is one of the greatest benefits to the IT community due to the Cloud Delivery Model, and from the beginning of pure play programming models like Windows Azure and Elastic Beanstalk it has moved to high-end services like data warehouse Platform as a Service. As the industry analysts see good adoption of the above service due to the huge cost advantages when compared to the traditional data warehouse platform, the best practices mentioned above will help to achieve the desired level of performance. Detailed documentation is also available on the vendor site in the form of developer and administrator guides.

More Stories By Srinivasan Sundara Rajan

Highly passionate about utilizing Digital Technologies to enable next generation enterprise. Believes in enterprise transformation through the Natives (Cloud Native & Mobile Native).

@CloudExpo Stories
Is the ongoing quest for agility in the data center forcing you to evaluate how to be a part of infrastructure automation efforts? As organizations evolve toward bimodal IT operations, they are embracing new service delivery models and leveraging virtualization to increase infrastructure agility. Therefore, the network must evolve in parallel to become equally agile. Read this essential piece of Gartner research for recommendations on achieving greater agility.
Pulzze Systems was happy to participate in such a premier event and thankful to be receiving the winning investment and global network support from G-Startup Worldwide. It is an exciting time for Pulzze to showcase the effectiveness of innovative technologies and enable them to make the world smarter and better. The reputable contest is held to identify promising startups around the globe that are assured to change the world through their innovative products and disruptive technologies. There w...
SYS-CON Events announced today Telecom Reseller has been named “Media Sponsor” of SYS-CON's 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Telecom Reseller reports on Unified Communications, UCaaS, BPaaS for enterprise and SMBs. They report extensively on both customer premises based solutions such as IP-PBX as well as cloud based and hosted platforms.
SYS-CON Events announced today that Venafi, the Immune System for the Internet™ and the leading provider of Next Generation Trust Protection, will exhibit at @DevOpsSummit at 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Venafi is the Immune System for the Internet™ that protects the foundation of all cybersecurity – cryptographic keys and digital certificates – so they can’t be misused by bad guys in attacks...
To paraphrase someone famous, "The definition of insanity is to do something the same way over and over again and expect a different result". Humans are creatures of habit and when it comes to storage, old habits die hard. Why do we continue to put our faith in legacy storage providers when they haven't invented anything new in decades. Sure, they re-badge their products every couple of years to make their messaging look modern, but ultimately, it's the same old stuff with a new coat of lipsti...
StarNet Communications Corp has announced the addition of three Secure Remote Desktop modules to its flagship X-Win32 PC X server. The new modules enable X-Win32 to safely tunnel the remote desktops from Linux and Unix servers to the user’s PC over encrypted SSH. Traditionally, users of PC X servers deploy the XDMCP protocol to display remote desktop environments such as the Gnome and KDE desktops on Linux servers and the CDE environment on Solaris Unix machines. XDMCP is used primarily on comp...
SYS-CON Events announced today that StarNet Communications will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. StarNet Communications’ FastX is the industry first cloud-based remote X Windows emulator. Using standard Web browsers (FireFox, Chrome, Safari, etc.) users from around the world gain highly secure access to applications and data hosted on Linux-based servers in a central data center. ...
Using new techniques of information modeling, indexing, and processing, new cloud-based systems can support cloud-based workloads previously not possible for high-throughput insurance, banking, and case-based applications. In his session at 18th Cloud Expo, John Newton, CTO, Founder and Chairman of Alfresco, described how to scale cloud-based content management repositories to store, manage, and retrieve billions of documents and related information with fast and linear scalability. He addres...
Aspose.Total for .NET is the most complete package of all file format APIs for .NET as offered by Aspose. It empowers developers to create, edit, render, print and convert between a wide range of popular document formats within any .NET, C#, ASP.NET and VB.NET applications. Aspose compiles all .NET APIs on a daily basis to ensure that it contains the most up to date versions of each of Aspose .NET APIs. If a new .NET API or a new version of existing APIs is released during the subscription peri...
The 19th International Cloud Expo has announced that its Call for Papers is open. Cloud Expo, to be held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, brings together Cloud Computing, Big Data, Internet of Things, DevOps, Digital Transformation, Microservices and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportuni...
To leverage Continuous Delivery, enterprises must consider impacts that span functional silos, as well as applications that touch older, slower moving components. Managing the many dependencies can cause slowdowns. See how to achieve continuous delivery in the enterprise.
DevOps at Cloud Expo – being held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA – announces that its Call for Papers is open. Born out of proven success in agile development, cloud computing, and process automation, DevOps is a macro trend you cannot afford to miss. From showcase success stories from early adopters and web-scale businesses, DevOps is expanding to organizations of all sizes, including the world's largest enterprises – and delivering real results. Am...
DevOps at Cloud Expo, taking place Nov 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 19th Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long dev...
SYS-CON Events announced today that eCube Systems, a leading provider of middleware modernization, integration, and management solutions, will exhibit at @DevOpsSummit at 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. eCube Systems offers a family of middleware evolution products and services that maximize return on technology investment by leveraging existing technical equity to meet evolving business needs. ...
Ixia (Nasdaq: XXIA) has announced that NoviFlow Inc.has deployed IxNetwork® to validate the company’s designs and accelerate the delivery of its proven, reliable products. Based in Montréal, NoviFlow Inc. supports network carriers, hyperscale data center operators, and enterprises seeking greater network control and flexibility, network scalability, and the capacity to handle extremely large numbers of flows, while maintaining maximum network performance. To meet these requirements, NoviFlow in...
Enterprises have forever faced challenges surrounding the sharing of their intellectual property. Emerging cloud adoption has made it more compelling for enterprises to digitize their content, making them available over a wide variety of devices across the Internet. In his session at 19th Cloud Expo, Santosh Ahuja, Director of Architecture at Impiger Technologies, will introduce various mechanisms provided by cloud service providers today to manage and share digital content in a secure manner....
As the world moves toward more DevOps and Microservices, application deployment to the cloud ought to become a lot simpler. The Microservices architecture, which is the basis of many new age distributed systems such as OpenStack, NetFlix and so on, is at the heart of Cloud Foundry - a complete developer-oriented Platform as a Service (PaaS) that is IaaS agnostic and supports vCloud, OpenStack and AWS. Serverless computing is revolutionizing computing. In his session at 19th Cloud Expo, Raghav...
In today's uber-connected, consumer-centric, cloud-enabled, insights-driven, multi-device, global world, the focus of solutions has shifted from the product that is sold to the person who is buying the product or service. Enterprises have rebranded their business around the consumers of their products. The buyer is the person and the focus is not on the offering. The person is connected through multiple devices, wearables, at home, on the road, and in multiple locations, sometimes simultaneously...
Fact: storage performance problems have only gotten more complicated, as applications not only have become largely virtualized, but also have moved to cloud-based infrastructures. Storage performance in virtualized environments isn’t just about IOPS anymore. Instead, you need to guarantee performance for individual VMs, helping applications maintain performance as the number of VMs continues to go up in real time. In his session at Cloud Expo, Dhiraj Sehgal, Product and Marketing at Tintri, wil...
Internet of @ThingsExpo, taking place November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 19th Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The Internet of Things (IoT) is the most profound change in personal and enterprise IT since the creation of the Worldwide Web more than 20 years ago. All major researchers estimate there will be tens of billions devices - comp...