Welcome!

@CloudExpo Authors: Pat Romanski, Elizabeth White, Liz McMillan, Mehdi Daoudi, Rene Buest

Related Topics: @CloudExpo

@CloudExpo: Blog Feed Post

Object Storage for Big Unstructured Data

There are two kinds of Big Data: Big Data (for analytics) and Big Unstructured Data

Big Data is Big, but it also causes a lot of confusion. Big Data is used for anything related storage these days, so people don’t know anymore what it exactly is. Is it Hadoop? Is it analytics? It doesn’t need to be that complicated though. There are two kinds of Big Data: Big Data (for analytics) and Big Unstructured Data.

Big Data for analytics is a paradigm that became popular in the previous decade. A lot of innovation was done for research projects. New technology enabled researchers in many different domains to capture data in a way they had never been able to do before. In agriculture, for example, ploughs would get sensors that would send little bits of information to a central system (over satellite). Every couple of feet these sensors would measure what’s in the ground (minerals for example), how humid the ground is etc. Based on that, large agriculture companies would then be able to make better decisions on where to grow which crop.

The problem was that traditional systems to store this massive amount of small data (relational databases) were no longer adequate to store this information. Systems like MapReduce and Hadoop were created as an alternative and would store these massive volumes of files as concatenated “Big” files. Big Data was born, Big Data for semi-structured data.

Today we are seeing a similar trend with unstructured data. Studies show that data storage requirements will grow with a factor 30 over the next decade. 80% of that data are large files: office documents, movies, music, pictures. Similar to how the databases in the previous decade, traditional storage – file systems – is not the best way to store this data. File systems will not scale sufficiently and actually become obsolete as applications will take over the role of the file system.

A nice example is what Google Picasa does for us: in the old days we would store pictures nicely organized in a file system (hopefully with some backups). One folder per year, one per month in each year, one per holiday or party. Today, we just dump all the pictures in one folder and Picasa will sort them for us based on date, location, face recognition (!) or other metadata. With an intelligent query, we can display the right pictures very fast, much faster than browsing the file system. We don’t even have to worry about backups as we can store copies in the cloud automatically.

The new paradigm that will help us store these massive amounts of unstructured data is Object Storage. Object Storage systems are uniformly scalable pools of storage that are accessible through a REST interface. Files – objects – are dumped into the pool and an identifier is kept to locate the object when it is needed. Applications that are designed to run on top of object storage will use these identifiers through the REST protocol. A good analogy is parking your car Valet vs. self park. When you self park you have to remember the lot, the floor, the isle etc (file system); with Valet you get a receipt when you give your keys and you will later use that receipt to get your car back.

So what is needed to build an object storage system? Basically just lots of disks, a REST API and a way to provide durability. This could be done with traditional systems like RAID but the problem is that RAID requires a huge amount of overhead to provide acceptable availability. The more data we store, the more painful it is to be needing 200% overhead as some systems do. The smarter way to provide durability for object storage is erasure encoding.

Erasure encoding stores objects as equations, which are spread over the entire storage pool: Data objects are split up in sub-blocks, from which equations are calculated. According to the availability policy, an overhead of equations is calculated and the equations are spread over as many disks are possible, also policy-defined. As a result, when a disk breaks, the system will always have sufficient equations to restore the original data block. If a disk is broken, the system can re-calculate equations as a background task to bring the number of available equations on a healthy level again. A pioneer of this technology is Amplidata, who use low power Atom processors in their hardware to reduce power costs. As the entire system, all storage nodes, can recalculate missing equations as a background task, Amplidata figured out it was not necessary to use the high-end nodes that RAID systems need (to speed up restores and avoid performance losses).

Apart from providing a more efficient and a more scalable way to store data, erasure coding based object storage can save up to 70% on the overall TCO thanks to reduced raw storage needs and reduced power needs (less hardware + low power devices save on power and cooling). Also, uniformly scalable storage systems with an automated healing mechanism drastically reduce the management effort and cost.

So what are the use cases for object storage? As data needs grow, object storage will become the storage paradigm of choice in more and more environments, but already today we see the need in a number of situations:

Building live archives
Object storage enables companies to re-activate their data. Currently, most companies see data more as a burden than anything else: the data will never be used again but needs to be archived for a whole lot of reasons. But this data actually has a lot of value. By using live archives, employees have faster access to older data and they can use those valuable resources. With traditional storage it would never be achievable to build disk based archives for this purpose as the overhead would make this too costly.

Online applications
Most of the data-intensive online – cloud – applications are built on public clouds such as Amazon S3, which are early implementations of Object Storage. The benefits for the application providers are plenty: a simple programming interface, low cost and fast time to market. As their data sets grow, those companies might move to private Object Storage implementations to reduce costs even more.

Media and entertainment
Traditionally, the M&E industry has been very much file-oriented but we’re seeing a growing interest in object storage to optimize efficiency and reduce costs, but also because this industry is already hitting the limits of their file systems.

These are just a few examples of Object Storage implementations for Big Unstructured Data. Object Storage was not built to replace any of the current storage architectures. Very much like NAS filers were designed in the 90ies because block storage (SAN was designed when databases were king) was not optimized for Unstructured Data, Object Storage will find it’s place next to those two for Big Unstructured Data.

More Stories By Tom Leyden

Tom Leyden is VP Product Marketing at Scality. Scality was founded in 2009 by a team of entrepreneurs and technologists. The idea wasn’t storage, per se. When the Scality team talked to the initial base of potential customers, the customers wanted a system that could “route” data to and from individual users in the most scalable, efficient way possible. And so began a non-traditional approach to building a storage system that no one had imagined before. No one thought an object store could have enough performance for all the files and attachments of millions of users. No one thought a system could remain up and running through software upgrades, hardware failures, capacity expansions, and even multiple hardware generations coexisting. And no one believed you could do all this and scale to petabytes of content and billions of objects in pure software.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@CloudExpo Stories
The goal of Continuous Testing is to shift testing left to find defects earlier and release software faster. This can be achieved by integrating a set of open source functional and performance testing tools in the early stages of your software delivery lifecycle. There is one process that binds all application delivery stages together into one well-orchestrated machine: Continuous Testing. Continuous Testing is the conveyer belt between the Software Factory and production stages. Artifacts are m...
When shopping for a new data processing platform for IoT solutions, many development teams want to be able to test-drive options before making a choice. Yet when evaluating an IoT solution, it’s simply not feasible to do so at scale with physical devices. Building a sensor simulator is the next best choice; however, generating a realistic simulation at very high TPS with ease of configurability is a formidable challenge. When dealing with multiple application or transport protocols, you would be...
Cloud resources, although available in abundance, are inherently volatile. For transactional computing, like ERP and most enterprise software, this is a challenge as transactional integrity and data fidelity is paramount – making it a challenge to create cloud native applications while relying on RDBMS. In his session at 21st Cloud Expo, Claus Jepsen, Chief Architect and Head of Innovation Labs at Unit4, will explore that in order to create distributed and scalable solutions ensuring high availa...
An increasing number of companies are creating products that combine data with analytical capabilities. Running interactive queries on Big Data requires complex architectures to store and query data effectively, typically involving data streams, an choosing efficient file format/database and multiple independent systems that are tied together through custom-engineered pipelines. In his session at @BigDataExpo at @ThingsExpo, Tomer Levi, a senior software engineer at Intel’s Advanced Analytics ...
"We're a cybersecurity firm that specializes in engineering security solutions both at the software and hardware level. Security cannot be an after-the-fact afterthought, which is what it's become," stated Richard Blech, Chief Executive Officer at Secure Channels, in this SYS-CON.tv interview at @ThingsExpo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
Enterprise architects are increasingly adopting multi-cloud strategies as they seek to utilize existing data center assets, leverage the advantages of cloud computing and avoid cloud vendor lock-in. This requires a globally aware traffic management strategy that can monitor infrastructure health across data centers and end-user experience globally, while responding to control changes and system specification at the speed of today’s DevOps teams. In his session at 20th Cloud Expo, Josh Gray, Chie...
To get the most out of their data, successful companies are not focusing on queries and data lakes, they are actively integrating analytics into their operations with a data-first application development approach. Real-time adjustments to improve revenues, reduce costs, or mitigate risk rely on applications that minimize latency on a variety of data sources. Jack Norris reviews best practices to show how companies develop, deploy, and dynamically update these applications and how this data-first...
Intelligent Automation is now one of the key business imperatives for CIOs and CISOs impacting all areas of business today. In his session at 21st Cloud Expo, Brian Boeggeman, VP Alliances & Partnerships at Ayehu, will talk about how business value is created and delivered through intelligent automation to today’s enterprises. The open ecosystem platform approach toward Intelligent Automation that Ayehu delivers to the market is core to enabling the creation of the self-driving enterprise.
"We're here to tell the world about our cloud-scale infrastructure that we have at Juniper combined with the world-class security that we put into the cloud," explained Lisa Guess, VP of Systems Engineering at Juniper Networks, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
Historically, some banking activities such as trading have been relying heavily on analytics and cutting edge algorithmic tools. The coming of age of powerful data analytics solutions combined with the development of intelligent algorithms have created new opportunities for financial institutions. In his session at 20th Cloud Expo, Sebastien Meunier, Head of Digital for North America at Chappuis Halder & Co., discussed how these tools can be leveraged to develop a lasting competitive advantage ...
As businesses adopt functionalities in cloud computing, it’s imperative that IT operations consistently ensure cloud systems work correctly – all of the time, and to their best capabilities. In his session at @BigDataExpo, Bernd Harzog, CEO and founder of OpsDataStore, presented an industry answer to the common question, “Are you running IT operations as efficiently and as cost effectively as you need to?” He then expounded on the industry issues he frequently came up against as an analyst, and ...
The question before companies today is not whether to become intelligent, it’s a question of how and how fast. The key is to adopt and deploy an intelligent application strategy while simultaneously preparing to scale that intelligence. In her session at 21st Cloud Expo, Sangeeta Chakraborty, Chief Customer Officer at Ayasdi, will provide a tactical framework to become a truly intelligent enterprise, including how to identify the right applications for AI, how to build a Center of Excellence to ...
In his session at 20th Cloud Expo, Mike Johnston, an infrastructure engineer at Supergiant.io, discussed how to use Kubernetes to set up a SaaS infrastructure for your business. Mike Johnston is an infrastructure engineer at Supergiant.io with over 12 years of experience designing, deploying, and maintaining server and workstation infrastructure at all scales. He has experience with brick and mortar data centers as well as cloud providers like Digital Ocean, Amazon Web Services, and Rackspace. H...
You know you need the cloud, but you’re hesitant to simply dump everything at Amazon since you know that not all workloads are suitable for cloud. You know that you want the kind of ease of use and scalability that you get with public cloud, but your applications are architected in a way that makes the public cloud a non-starter. You’re looking at private cloud solutions based on hyperconverged infrastructure, but you’re concerned with the limits inherent in those technologies.
SYS-CON Events announced today that Massive Networks will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Massive Networks mission is simple. To help your business operate seamlessly with fast, reliable, and secure internet and network solutions. Improve your customer's experience with outstanding connections to your cloud.
DevOps is under attack because developers don’t want to mess with infrastructure. They will happily own their code into production, but want to use platforms instead of raw automation. That’s changing the landscape that we understand as DevOps with both architecture concepts (CloudNative) and process redefinition (SRE). Rob Hirschfeld’s recent work in Kubernetes operations has led to the conclusion that containers and related platforms have changed the way we should be thinking about DevOps and...
Everything run by electricity will eventually be connected to the Internet. Get ahead of the Internet of Things revolution and join Akvelon expert and IoT industry leader, Sergey Grebnov, in his session at @ThingsExpo, for an educational dive into the world of managing your home, workplace and all the devices they contain with the power of machine-based AI and intelligent Bot services for a completely streamlined experience.
Because IoT devices are deployed in mission-critical environments more than ever before, it’s increasingly imperative they be truly smart. IoT sensors simply stockpiling data isn’t useful. IoT must be artificially and naturally intelligent in order to provide more value In his session at @ThingsExpo, John Crupi, Vice President and Engineering System Architect at Greenwave Systems, will discuss how IoT artificial intelligence (AI) can be carried out via edge analytics and machine learning techn...
FinTechs use the cloud to operate at the speed and scale of digital financial activity, but are often hindered by the complexity of managing security and compliance in the cloud. In his session at 20th Cloud Expo, Sesh Murthy, co-founder and CTO of Cloud Raxak, showed how proactive and automated cloud security enables FinTechs to leverage the cloud to achieve their business goals. Through business-driven cloud security, FinTechs can speed time-to-market, diminish risk and costs, maintain continu...
SYS-CON Events announced today that Datera, that offers a radically new data management architecture, has been named "Exhibitor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Datera is transforming the traditional datacenter model through modern cloud simplicity. The technology industry is at another major inflection point. The rise of mobile, the Internet of Things, data storage and Big...