Welcome!

@CloudExpo Authors: Stefan Bernbo, Todd Matters, Elizabeth White, Liz McMillan, ManageEngine IT Matters

Related Topics: @BigDataExpo, @CloudExpo, @ThingsExpo

@BigDataExpo: Blog Feed Post

Data Unification at Scale | @CloudExpo #BigData #DataLake #AI #Analytics

This term Data Unification is new in the Big Data lexicon, pushed by varieties of companies

This term Data Unification is new in the Big Data lexicon, pushed by varieties of companies such as Talend, 1010Data, and TamR. Data unification deals with the domain known as ETL (Extraction, Transformation, Loading), initiated during the 1990s when Data Warehousing was gaining relevance. ETL refers to the process of extracting data from inside or outside sources (multiple applications typically developed and supported by different vendors or hosted on separate hardware), transform it to fit operational needs (based on business rules), and load it into end target databases, more specifically, an operational data store, data mart, or a data warehouse. These are read-only databases for analytics. Initially the analytics was mostly retroactive (e.g. how many shoppers between age 25-35 bought this item between May and July?). This was like driving a car looking at the rear-view mirror. Then forward-looking analysis (called data mining) started to appear. Now business also demands "predictive analytics" and "streaming analytics".

During my IBM and Oracle days, the ETL in the first phase was left for outside companies to address. This was unglamorous work and key vendors were not that interested to solve this. This gave rise to many new players such as Informatica, Datastage, Talend and it became quite a thriving business. We also see many open-source ETL companies.

The ETL methodology consisted of: constructing a global schema in advance, for each local data source write a program to understand the source and map to the global schema, then write a script to transform, clean (homonym and synonym issues) and dedup (get rid of duplicates) it. Programs were set up to build the ETL pipeline. This process has matured over 20 years and is used today for data unification problems. The term MDM (Master Data Management) points to a master representation of all enterprise objects, to which everybody agrees to confirm.

In the world of Big Data, this approach is very inadequate. Why?

  • Data unification at scale is a very big deal. The schema-first approach works fine with retail data (sales transactions, not many data sources,..), but gets extremely hard with sources that can be hundreds or even thousands. This gets worse when you want to unify public data from the web with enterprise data.
  • Human labor to map each source to a master schema gets to be costly and excessive. Here machine learning is required and domain experts should be asked to augment where needed.
  • Real-time data unification of streaming data and analysis can not be handled by these solutions.

Another solution called "data lake" where you store disparate data in their native format, seems to address the "ingest" problem only. It tries to change the order of ETL to ELT (first load then transform). However it does not address the scale issues. The new world needs bottoms-up data unification (schema-last) in real-time or near real-time.

The typical data unification cycle can go like this - start with a few sources, try enriching the data with say X, see if it works, if you fail then loop back and try again. Use enrichment to improve and do everything automatically using machine learning and statistics. But iterate furiously. Ask for help when needed from domain experts. Otherwise the current approach of ETL or ELT can get very expensive.

  • LikeData Unification at scale
  • Comment
  • ShareShare Data Unification at scale



Read the original blog entry...

More Stories By Jnan Dash

Jnan Dash is Senior Advisor at EZShield Inc., Advisor at ScaleDB and Board Member at Compassites Software Solutions. He has lived in Silicon Valley since 1979. Formerly he was the Chief Strategy Officer (Consulting) at Curl Inc., before which he spent ten years at Oracle Corporation and was the Group Vice President, Systems Architecture and Technology till 2002. He was responsible for setting Oracle's core database and application server product directions and interacted with customers worldwide in translating future needs to product plans. Before that he spent 16 years at IBM. He blogs at http://jnandash.ulitzer.com.

@CloudExpo Stories
@DevOpsSummit at Cloud Expo taking place Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center, Santa Clara, CA, is co-located with the 21st International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is ...
You know you need the cloud, but you’re hesitant to simply dump everything at Amazon since you know that not all workloads are suitable for cloud. You know that you want the kind of ease of use and scalability that you get with public cloud, but your applications are architected in a way that makes the public cloud a non-starter. You’re looking at private cloud solutions based on hyperconverged infrastructure, but you’re concerned with the limits inherent in those technologies.
A look across the tech landscape at the disruptive technologies that are increasing in prominence and speculate as to which will be most impactful for communications – namely, AI and Cloud Computing. In his session at 20th Cloud Expo, Curtis Peterson, VP of Operations at RingCentral, highlighted the current challenges of these transformative technologies and shared strategies for preparing your organization for these changes. This “view from the top” outlined the latest trends and developments i...
The current age of digital transformation means that IT organizations must adapt their toolset to cover all digital experiences, beyond just the end users’. Today’s businesses can no longer focus solely on the digital interactions they manage with employees or customers; they must now contend with non-traditional factors. Whether it's the power of brand to make or break a company, the need to monitor across all locations 24/7, or the ability to proactively resolve issues, companies must adapt to...
"We focus on composable infrastructure. Composable infrastructure has been named by companies like Gartner as the evolution of the IT infrastructure where everything is now driven by software," explained Bruno Andrade, CEO and Founder of HTBase, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"Tintri focuses on the Ops side of the DevOps, which basically is pushing more and more of the accessibility of the infrastructure to the developers and trying to get behind the scenes," explained Dhiraj Sehgal of Tintri in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
Hardware virtualization and cloud computing allowed us to increase resource utilization and increase our flexibility to respond to business demand. Docker Containers are the next quantum leap - Are they?! Databases always represented an additional set of challenges unique to running workloads requiring a maximum of I/O, network, CPU resources combined with data locality.
For organizations that have amassed large sums of software complexity, taking a microservices approach is the first step toward DevOps and continuous improvement / development. Integrating system-level analysis with microservices makes it easier to change and add functionality to applications at any time without the increase of risk. Before you start big transformation projects or a cloud migration, make sure these changes won’t take down your entire organization.
Cloud promises the agility required by today’s digital businesses. As organizations adopt cloud based infrastructures and services, their IT resources become increasingly dynamic and hybrid in nature. Managing these require modern IT operations and tools. In his session at 20th Cloud Expo, Raj Sundaram, Senior Principal Product Manager at CA Technologies, will discuss how to modernize your IT operations in order to proactively manage your hybrid cloud and IT environments. He will be sharing bes...
Artificial intelligence, machine learning, neural networks. We’re in the midst of a wave of excitement around AI such as hasn’t been seen for a few decades. But those previous periods of inflated expectations led to troughs of disappointment. Will this time be different? Most likely. Applications of AI such as predictive analytics are already decreasing costs and improving reliability of industrial machinery. Furthermore, the funding and research going into AI now comes from a wide range of com...
In this presentation, Striim CTO and founder Steve Wilkes will discuss practical strategies for counteracting fraud and cyberattacks by leveraging real-time streaming analytics. In his session at @ThingsExpo, Steve Wilkes, Founder and Chief Technology Officer at Striim, will provide a detailed look into leveraging streaming data management to correlate events in real time, and identify potential breaches across IoT and non-IoT systems throughout the enterprise. Strategies for processing massive ...
SYS-CON Events announced today that GrapeUp, the leading provider of rapid product development at the speed of business, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Grape Up is a software company, specialized in cloud native application development and professional services related to Cloud Foundry PaaS. With five expert teams that operate in various sectors of the market acr...
SYS-CON Events announced today that Ayehu will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on October 31 - November 2, 2017 at the Santa Clara Convention Center in Santa Clara California. Ayehu provides IT Process Automation & Orchestration solutions for IT and Security professionals to identify and resolve critical incidents and enable rapid containment, eradication, and recovery from cyber security breaches. Ayehu provides customers greater control over IT infras...
Internet of @ThingsExpo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 21st Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The Internet of Things (IoT) is the most profound change in personal and enterprise IT since the creation of the Worldwide Web more than 20 years ago. All major researchers estimate there will be tens of billions devic...
SYS-CON Events announced today that MobiDev, a client-oriented software development company, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. MobiDev is a software company that develops and delivers turn-key mobile apps, websites, web services, and complex software systems for startups and enterprises. Since 2009 it has grown from a small group of passionate engineers and business...
What's the role of an IT self-service portal when you get to continuous delivery and Infrastructure as Code? This general session showed how to create the continuous delivery culture and eight accelerators for leading the change. Don Demcsak is a DevOps and Cloud Native Modernization Principal for Dell EMC based out of New Jersey. He is a former, long time, Microsoft Most Valuable Professional, specializing in building and architecting Application Delivery Pipelines for hybrid legacy, and cloud ...
With major technology companies and startups seriously embracing Cloud strategies, now is the perfect time to attend 21st Cloud Expo October 31 - November 2, 2017, at the Santa Clara Convention Center, CA, and June 12-14, 2018, at the Javits Center in New York City, NY, and learn what is going on, contribute to the discussions, and ensure that your enterprise is on the right path to Digital Transformation.
21st International Cloud Expo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy. Me...
Join us at Cloud Expo June 6-8 to find out how to securely connect your cloud app to any cloud or on-premises data source – without complex firewall changes. More users are demanding access to on-premises data from their cloud applications. It’s no longer a “nice-to-have” but an important differentiator that drives competitive advantages. It’s the new “must have” in the hybrid era. Users want capabilities that give them a unified view of the data to get closer to customers and grow business. The...
SYS-CON Events announced today that Cloud Academy named "Bronze Sponsor" of 21st International Cloud Expo which will take place October 31 - November 2, 2017 at the Santa Clara Convention Center in Santa Clara, CA. Cloud Academy is the industry’s most innovative, vendor-neutral cloud technology training platform. Cloud Academy provides continuous learning solutions for individuals and enterprise teams for Amazon Web Services, Microsoft Azure, Google Cloud Platform, and the most popular cloud com...