SYS-CON Events announced today that nfina Technologies, a provider of highly reliable cloud server products, will exhibit at SYS-CON's 12th International Cloud Expo, which will take place on June 10–13, 2013, at the Javits Center in New York City, New York.
nfina Technologies develops, manufactures, and markets highly reliable cloud server products, designed to solve the most demanding data center requirements in mission-critical cloud applications. Nfina’s staff has decades of experience in co...| By Srinivasan Sundara Rajan | Article Rating: |
|
| March 14, 2013 11:15 AM EDT | Reads: |
2,891 |
Data Warehouse as a Service
Recently Amazon announced the availability of Redshift Data warehouse as a Service as a beta offering. Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. It's optimized for datasets ranging from a few hundred gigabytes to a petabyte or more and costs less than $1,000 per terabyte per year, a tenth the cost of most traditional data warehousing solutions.
Architecture Behind Redshift
Any data warehouse service meant to serve data of petabyte scale should have a robust architecture as its backbone. The following are the salient features of Redshift service.
- Shared Nothing Architecture: As indicated in one of my earlier articles, Cloud Database Scale Out Using Shared Nothing Architecture, the shared nothing architectural pattern is the most desired for databases of this scale and the same concept is adhered to in Redshift. The core component of Redshift is a cluster and each cluster consists of multiple compute nodes, each node has its dedicated storage following the shared nothing principle.
- Massively Parallel Processing (MPP): Hand in hand with the shared nothing pattern MPP provides horizontal scale out capabilities for large data warehouses rather than scaling up the individual servers. Massively parallel processing (MPP) enables fast execution of the most complex queries operating on large amounts of data. Multiple compute nodes handle all query processing leading up to the final result aggregation, with each core of each node executing the same compiled query segments on portions of the entire data. With the concept of NodeSlices Redshift has taken the MPP to the next level to the cores of a compute node. A compute node is partitioned into slices; one slice for each core of the node's multi-core processor. Each slice is allocated a portion of the node's memory and disk space, where it processes a portion of the workload assigned to the node.

Refer to the following diagram from AWS Documentation, about Data warehouse system architecture
- Columnar Data Storage: Storing database table information in a columnar fashion reduces the number of disk I/O requests and reduces the amount of data you need to load from disk. Columnar storage for database tables drastically reduces the overall disk I/O requirements and is an important factor in optimizing analytic query performance.
- Leader Node: The leader node manages most communications with client programs and all communication with compute nodes. It parses and develops execution plans to carry out database operations, in particular, the series of steps necessary to obtain results for complex queries. Based on the execution plan, the leader node distributes compiled code to the compute nodes and assigns a portion of the data to each compute node.
- High Speed Network Connect: The clusters are connected internally by a 10 Gigabit Ethernet network, providing very fast communication between the leader node and the compute clusters.
Best Practices in Application Design on Redshift
The enablement of Big Data analytics through Redshift has created lot of excitement among the community. The usage of these kinds of alternate approaches to traditional data warehousing will be best in conjunction with the best practices for utilizing the features. The following are some of the best practices that can be considered for the design of applications on Redshift.
1. Collocated Tables: It is good practice to try to avoid sending data between the nodes to satisfy JOIN queries. Colocation between two joined tables occurs when the matching rows of the two tables are stored in the same compute nodes, so that the data need not be sent between nodes.
When you add data to a table, Amazon Redshift distributes the rows in the table to the cluster slices using one of two methods:
- Even distribution
- Key distribution
Even distribution is the default distribution method. With even distribution, the leader node spreads data rows across the slices in a round-robin fashion, regardless of the values that exist in any particular column. This approach is a good choice when you don't have a clear option for a distribution key.
If you specify a distribution key when you create a table, the leader node distributes the data rows to the slices based on the values in the distribution key column. Matching values from the distribution key column are stored together.
Colocation is best achieved by choosing the appropriate distribution keys than using the even distribution.
If you frequently join a table, specify the join column as the distribution key. If a table joins with multiple other tables, distribute on the foreign key of the largest dimension that the table joins with. If the dimension tables are filtered as part of the joins, compare the size of the data after filtering when you choose the largest dimension. This ensures that the rows involved with your largest joins will generally be distributed to the same physical nodes. Because local joins avoid data movement, they will perform better than network joins.
2. De-Normalization: In the traditional RDBMS, database storage is optimized by applying the normalization principles such that a particular attribute (column) is associated with one and only entity (Table). However in shared nothing scalable databases like Redshift this technique will not yield the desired results, rather keeping the redundancy of certain columns in the form of de-normalization is very important.
For example, the following query is one of the examples of a high performance query in the Redshift documentation.
SELECT * FROM tab1, tab2
WHERE tab1.key = tab2.key
AND tab1.timestamp > ‘1/1/2013'
AND tab2.timestamp > ‘1/1/2013';
Even if a predicate is already being applied on a table in a join query but transitively applies to another table in the query, it's useful to re-specify the redundant predicate if that other table is also sorted on the column in the predicate. That way, when scanning the other table, Redshift can efficiently skip blocks from that table as well.
By carefully applying de-normalization to bring the required redundancy, Amazon Redshift can perform at its best.
3. Native Parallelism: One of the biggest advantages of a shared nothing MPP architecture is about parallelism. Parallelism is achieved in multiple ways.
- Inter Node Parallelism: It refers the ability of the database system to break up a query into multiple parts across multiple instances across the cluster.
- Intra Node Parallelism: Intra node parallelism refers to the ability to break up query into multiple parts within a single compute node.
Typically in MPP architectures, both Inter Node Parallelism and Intra Node Parallelism will be combined and used at the same time to provide dramatic performance gains.
Amazon Redshift provides lot of operations to utilize both Intra Node and Inter Node parallelism.
When you use a COPY command to load data from Amazon S3, first split your data into multiple files instead of loading all the data from a single large file.
The COPY command then loads the data in parallel from multiple files, dividing the workload among the nodes in your cluster. Split your data into files so that the number of files is a multiple of the number of slices in your cluster. That way Amazon Redshift can divide the data evenly among the slices. Name each file with a common prefix. For example, each XL compute node has two slices, and each 8XL compute node has 16 slices. If you have a cluster with two XL nodes, you might split your data into four files named customer_1, customer_2, customer_3, and customer_4. Amazon Redshift does not take file size into account when dividing the workload, so make sure the files are roughly the same size.
Pre-Processing Data: Over the years RDBMS engines take pride of Location Independence. The Codd's 12 rules of the RDBMS states the following:
Rule 8: Physical data independence:
Changes to the physical level (how the data is stored, whether in arrays or linked lists, etc.) must not require a change to an application based on the structure.
However, in the columnar database services like Redshift the physical ordering of data does make major impact to the performance.
Sorting data is a mechanism for optimizing query performance.
When you create a table, you can define one or more of its columns as the sort key. When data is loaded into the table, the values in the sort key column (or columns) are stored on disk in sorted order. Information about sort key columns is passed to the query planner, and the planner uses this information to construct plans that exploit the way that the data is sorted. For example, a merge join, which is often faster than a hash join, is feasible when the data is distributed and presorted on the joining columns.
The VACUUM command also makes sure that new data in tables is fully sorted on disk. Vacuum as often as you need to in order to maintain a consistent query performance.
Summary
Platform as a Service (PaaS) is one of the greatest benefits to the IT community due to the Cloud Delivery Model, and from the beginning of pure play programming models like Windows Azure and Elastic Beanstalk it has moved to high-end services like data warehouse Platform as a Service. As the industry analysts see good adoption of the above service due to the huge cost advantages when compared to the traditional data warehouse platform, the best practices mentioned above will help to achieve the desired level of performance. Detailed documentation is also available on the vendor site in the form of developer and administrator guides.
Published March 14, 2013 Reads 2,891
Copyright © 2013 SYS-CON Media, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
More Stories By Srinivasan Sundara Rajan
Srinivasan Sundara Rajan (Also Known As Sundar) Is A Enterprise Technology Enabler for realizing business capabilities. His primary focus is enabling Agile Enterprises by facilitating the adoption of Every Thing As A Service Model with particular concentration on BpaaS (Business Process As A Service). He also helps enterprises in getting meaningful insights from their structured and unstructured and real time data sources. All the views expressed are Srinivasan's independent analysis of industry and solutions and need not necessarily be of his current or past organizations. Srinivasan would like to thank every one who augmented his Architectural skills with Analytical ideas.
SYS-CON Events announced today that nfina Technologies, a provider of highly reliable cloud server products, will exhibit at SYS-CON's 12th International Cloud Expo, which will take place on June 10–13, 2013, at the Javits Center in New York City, New York.
nfina Technologies develops, manufactures, and markets highly reliable cloud server products, designed to solve the most demanding data center requirements in mission-critical cloud applications. Nfina’s staff has decades of experience in co...May. 25, 2013 03:00 PM EDT Reads: 1,285 |
By Liz McMillan SYS-CON Events announced today that OpenStack will exhibit at SYS-CON's 12th International Cloud Expo, which will take place on June 10–13, 2013, at the Javits Center in New York City, New York. OpenStack software controls large pools of compute, storage, and networking resources throughout a datacenter, all managed by a dashboard that gives administrators control while empowering their users to provision resources through a web interface.
OpenStack powers some of the most widely-used SaaS app...May. 25, 2013 02:00 PM EDT Reads: 1,414 |
By Elizabeth White May. 25, 2013 01:00 PM EDT Reads: 1,471 |
By Liz McMillan “Social, mobile, analytics and cloud can’t be looked at as distinct technology trends; they are facets of the same movement and an everyday reality for consumers and businesses alike,” said Craig Sowell, IBM VP of SmartCloud Marketing, in this exclusive Q&A with Cloud Expo Conference Chair Jeremy Geelan. “This means that businesses need to start looking at trends as one: cloud is the delivery, analytics is the unique insight, social is a shareable service, and mobile is the ubiquitous access.”
...May. 25, 2013 01:00 PM EDT Reads: 1,367 |
By Jeremy Geelan With Cloud Expo New York | 12th Cloud Expo [June 10-13, 2013] hurtling towards us, let's take a look at the distinguished individuals in our incredible Speaker Faculty for the technical and strategy sessions at the conference coming up June 10-13 at the Jacob Javits Center in New York City.
We have technical and strategy sessions for you all four days dealing with every nook and cranny of Cloud Computing and Big Data, but what of those who are presenting? Who are they, where do they work, wha...May. 25, 2013 12:00 PM EDT Reads: 21,377 |
By Jeremy Geelan The new open source cloud orchestration platform called OpenStack is the promise of flexible network virtualization, and network overlays are looking closer than ever. The vision of this platform is to enable the on-demand creation of many distinct networks on top of one underlying physical infrastructure in the cloud environment. The platform will support automated provisioning and management of large groups of virtual machines or compute resources, including extensive monitoring in the cloud.May. 25, 2013 12:00 PM EDT Reads: 3,060 |
By Pat Romanski In his session at the 12th International Cloud Expo, Dave Eichorn, Global Data Center Practice Head at Zensar, will share a case study describing how a utility services company handled the migration of its Microsoft platform to the cloud. Challenged with the time-consuming task of opening operations out of temporary offices, this company struggled with the need to simultaneously access data that was accumulated from a vast amount of data-intensive jobs. Zensar migrated the company’s application ...May. 25, 2013 12:00 PM EDT Reads: 1,501 |
By Pat Romanski At pennies per virtual machine-hour, the economics of cloud computing are both compelling and daunting to replicate. Whether you are building your own cloud infrastructure, building a public cloud or choosing a cloud service, there are key strategy and technology decisions that make the difference between success and failure.
In his General Session at the 12th International Cloud Expo, Jason Waxman, VP in the Intel Architecture Group and general manager of the Cloud Platforms Group within Inte...May. 25, 2013 10:00 AM EDT Reads: 952 |
By Elizabeth White You're getting pitched every day from your legacy enterprise software and hardware vendors about "cloud." They're doing an amazing job of convincing your CIO and CTO about what cloud is and how you should use it. The reality is they're defending their shrinking market share and keeping you on the legacy treadmill for as long as they can by selling you solutions that aren't "cloud."
In her session at the 12th International Cloud Expo, Niki Acosta, Cloud Evangelista for Rackspace, will talk thro...May. 25, 2013 10:00 AM EDT Reads: 871 |
By Jeremy Geelan The rise of cloud computing has exposed hard drive-based storage as the new data center bottleneck. Combating this, data center managers have deployed SSDs to gain the performance needed to provide real-time access to data. However, due to budget constraints, many have turned to consumer-grade SSDs without understanding that they wear out quickly when processing enterprise workloads. In this session, Esther Spanjer will discuss recent endurance advancements in SSD technology that enable usage of...May. 25, 2013 10:00 AM EDT Reads: 2,789 |
- Cloud People: A Who's Who of Cloud Computing
- Cloud Expo New York Speaker Profile: Dave Linthicum – Cloud Technology Partners
- Cloud Expo New York: Cloud Is Changing the Economics of Business
- Windows Azure IaaS Reaches General Availability
- Cloud Expo New York Speaker Profile: Nicos Vekiarides – TwinStrata
- AMD and Adobe Collaborate on Upcoming Version of Adobe Premiere Pro Software to Enable Breakthrough Video Editing Performance Through Open Standards
- State and Local Governments Adopt Microsoft Dynamics CRM to Improve Citizen Service Delivery
- Enterasys Spotlights SDN's Impact on Traditional Networking in Upcoming Webinar
- New Relic Q1 2013 Blazes Past Growth Targets and Reaches 40,000 Active Customer Accounts
- Best CIO Practices Shared from SHI’s Customers
- Cloud Expo New York: Deploying Hybrid Cloud for Performance and Uptime
- Cloud Expo New York: Delivering Digital Marketing on the Cloud
- Cloud People: A Who's Who of Cloud Computing
- Cloud Expo New York: Best CIO Practices Shared from SHI’s Customers
- Cloud Expo New York Speaker Profile: Dave Linthicum – Cloud Technology Partners
- Cloud Expo New York: Cloud Is Changing the Economics of Business
- Cloud Expo New York: How to Use Google Apps Script
- Windows Azure IaaS Reaches General Availability
- Cloud Expo New York Speaker Profile: Nicos Vekiarides – TwinStrata
- AMD and Adobe Collaborate on Upcoming Version of Adobe Premiere Pro Software to Enable Breakthrough Video Editing Performance Through Open Standards
- Cloud Computing Bootcamp at Cloud Expo New York
- State and Local Governments Adopt Microsoft Dynamics CRM to Improve Citizen Service Delivery
- Enterasys Spotlights SDN's Impact on Traditional Networking in Upcoming Webinar
- New Relic Q1 2013 Blazes Past Growth Targets and Reaches 40,000 Active Customer Accounts
- The Top 150 Players in Cloud Computing
- What is Cloud Computing?
- Six Benefits of Cloud Computing
- The Top 250 Players in the Cloud Computing Ecosystem
- Twenty-One Experts Define Cloud Computing
- What's the Difference Between Cloud Computing and SaaS?
- The Future of Cloud Computing
- Virtualization Conference Keynote Webcast Live on SYS-CON.TV
- A Brief History of Cloud Computing: Is the Cloud There Yet?
- GDS International: Global Warming Scam?
- Cloud Expo Europe 2009 in Prague: Themes & Topics
- Cloud Computing Expo 2009 West: Call for Papers Now Closed








SYS-CON Events announced today that OpenStack will exhibit at SYS-CON's 12th International Cloud Expo, which will take place on June 10–13, 2013, at the Javits Center in New York City, New York. OpenStack software controls large pools of compute, storage, and networking resources throughout a datacenter, all managed by a dashboard that gives administrators control while empowering their users to provision resources through a web interface.
OpenStack powers some of the most widely-used SaaS app...
“Social, mobile, analytics and cloud can’t be looked at as distinct technology trends; they are facets of the same movement and an everyday reality for consumers and businesses alike,” said Craig Sowell, IBM VP of SmartCloud Marketing, in this exclusive Q&A with Cloud Expo Conference Chair Jeremy Geelan. “This means that businesses need to start looking at trends as one: cloud is the delivery, analytics is the unique insight, social is a shareable service, and mobile is the ubiquitous access.”
...
With Cloud Expo New York | 12th Cloud Expo [June 10-13, 2013] hurtling towards us, let's take a look at the distinguished individuals in our incredible Speaker Faculty for the technical and strategy sessions at the conference coming up June 10-13 at the Jacob Javits Center in New York City.
We have technical and strategy sessions for you all four days dealing with every nook and cranny of Cloud Computing and Big Data, but what of those who are presenting? Who are they, where do they work, wha...
The new open source cloud orchestration platform called OpenStack is the promise of flexible network virtualization, and network overlays are looking closer than ever. The vision of this platform is to enable the on-demand creation of many distinct networks on top of one underlying physical infrastructure in the cloud environment. The platform will support automated provisioning and management of large groups of virtual machines or compute resources, including extensive monitoring in the cloud.
In his session at the 12th International Cloud Expo, Dave Eichorn, Global Data Center Practice Head at Zensar, will share a case study describing how a utility services company handled the migration of its Microsoft platform to the cloud. Challenged with the time-consuming task of opening operations out of temporary offices, this company struggled with the need to simultaneously access data that was accumulated from a vast amount of data-intensive jobs. Zensar migrated the company’s application ...
At pennies per virtual machine-hour, the economics of cloud computing are both compelling and daunting to replicate. Whether you are building your own cloud infrastructure, building a public cloud or choosing a cloud service, there are key strategy and technology decisions that make the difference between success and failure.
In his General Session at the 12th International Cloud Expo, Jason Waxman, VP in the Intel Architecture Group and general manager of the Cloud Platforms Group within Inte...
You're getting pitched every day from your legacy enterprise software and hardware vendors about "cloud." They're doing an amazing job of convincing your CIO and CTO about what cloud is and how you should use it. The reality is they're defending their shrinking market share and keeping you on the legacy treadmill for as long as they can by selling you solutions that aren't "cloud."
In her session at the 12th International Cloud Expo, Niki Acosta, Cloud Evangelista for Rackspace, will talk thro...
The rise of cloud computing has exposed hard drive-based storage as the new data center bottleneck. Combating this, data center managers have deployed SSDs to gain the performance needed to provide real-time access to data. However, due to budget constraints, many have turned to consumer-grade SSDs without understanding that they wear out quickly when processing enterprise workloads. In this session, Esther Spanjer will discuss recent endurance advancements in SSD technology that enable usage of...
Hyper-V Replica is our included asynchronous site-to-site VM replication capability for Windows Server 2012 and our free Hyper-V Server 2012 bare-metal enterprise-grade hypervisor. Using Hyper-V Replica, you can quickly implement a cost-effective disaster recovery plan for your business critical VM...
Although often misunderstood, cloud computing ultimately relies on the same technological underpinnings as traditional server and storage options. While software, platforms and even infrastructure are farmed out to third-party providers, their ability to operate efficiently is constrained by the sam...
Imagine if you could take a time machine five years into the future, so that you would know which of today’s new technologies panned out and which did not.
Most companies have only started using cloud in the past two years. But there are some companies that have been using cloud for five years or...
While movement to the cloud keeps accelerating, fears about security hang on. Let’s take a look at the most common myths about cloud security that might be holding businesses back from taking advantage of the flexibility and scalability of the cloud model.
This is the piece of “common sense” that h...
“The last time I checked, people do not change their social security numbers very often...”
While in constant debate over data encryption and ease of access, I encountered a train of thought that made my jaw drop. A tradeshow attendee suggested encrypting everything, but just use a weak algorithm; ...
Don and I have four children, all of whom have had the fortune to take piano lessons (I'm not sure if the youngest would agree he's fortunate at this point in his life but at five, he's not really able to answer the question with any degree of wisdom, anyway. Come to think of it, not sure the other ...
Our prior post, A Roadmap to High-Value Cloud Infrastructure: Disaster Recovery and Data Protection, discussed both the benefits and limitations of a cloud-based disaster recovery (DR) strategy. As we highlighted last week, traditional disaster recovery options leave open a huge hole: At one extreme...
Online collaboration has evolved during the last decade, delivering even greater value -- thanks to a new generation of business technology applications. Forbes Insights released "Collaborating in the Cloud," a Cisco-sponsored study examining the ways business leaders increasingly look at cloud coll...
New technologies allow schools, colleges and universities to analyze absolutely everything that happens. From student behavior, testing results, career development of students as well as educational needs based on changing societies. A lot of this data has already been stored and is used for statist...
A recent Gartner study states that the function of the modern CIO is in flux and that his or her future focus must incorporate digital assets (aka cloud-based data and applications) to remain relevant. Towards the goal of riding the sea change a compiler of stacks to a broker of business needs, secu...














