Welcome!

@CloudExpo Authors: Carmen Gonzalez, Yeshim Deniz, Pat Romanski, Zakia Bouachraoui, Elizabeth White

Related Topics: @CloudExpo, Linux Containers, Open Source Cloud, Apache, @DXWorldExpo

@CloudExpo: Article

Apache Spark vs. Hadoop | @CloudExpo #BigData #DevOps #Microservices

A choice of job styles

If you’re running Big Data applications, you’re going to want to look at some kind of distributed processing system. Hadoop is one of the best-known clustering systems, but how are you going to process all your data in a reasonable time frame? Apache Spark offers services that go beyond a standard MapReduce cluster.

A choice of job styles
MapReduce has become a standard, perhaps
the standard, for distributed file systems. While it’s a great system already, it’s really geared toward batch use, with jobs needing to queue for later output. This can severely hamper your flexibility. What if you want to explore some of your data? If it’s going to take all night, forget about it.

With Apache Spark, you can act on your data in whatever way you want. Want to look for interesting tidbits in your data? You can perform some quick queries. Want to run something you know will take a long time? You can use a batch job. Want to process your data streams in real time? You can do that too.

The biggest advantage of modern programming languages is their use of interactive shells. Sure, Lisp did that back in the ‘60s, but it was a long time before the kind of power to program interactively became available to the average programmer. With Python and Scala you can try out your ideas in real time and develop algorithms iteratively, without the time-consuming write/compile/test/debug cycle.

RDDs
The key to Spark’s flexibility is the Resilient Distributed Datasets, or RDDs. RDDs maintain a lineage of everything that’s done to your data. They’re fine-grained, keeping track of all changes that have been made from other transformations such as
map or join. This means that it’s possible to recover from failures by rebuilding from these transformations (which is why they’re called Resilient Distributed Datasets).

RDDs also represent data in memory, which is a lot faster than always pulling data off of disks—even with SSDs making their way into data centers. While having your data in memory might seem like a recipe for slow performance, Spark uses lazy evaluation, only making transformations on data when you specifically ask for the result. This is why you can get queries so quickly even on very large datasets.

You might have recognized the term “lazy evaluation” from functional programming languages like Haskell. RDDs are only loaded when specific actions produce some kind of output; for example, printing to a text file. You can have a complex query over your data, but it won’t actually be evaluated until you ask for it. And the query might only find a specific subset of your data instead of plowing through the whole thing. This lazy evaluation lets you create complex queries on large datasets without incurring a performance penalty.

RDDs are also immutable, which leads to greater protection against data loss even though they’re in memory. In case of an error, Spark can go back to the last part of an RDD’s lineage and recover from there rather than relying on a checkpoint-based system on a disk.

Spark and Hadoop, Not as Different as You Think
Speaking of disks, you might be wondering whether Spark replaces a Hadoop cluster. That’s really a false dichotomy. Hadoop and Spark work
together. While Spark provides the processing, Hadoop handles the actual storage and resource management. After all, you can’t store data in your memory forever.

With the combination of Spark and Hadoop in the same cluster, you can cut down on a lot of overhead in maintaining different clusters. This combined cluster will give you unlimited scale for Big Data operations.

Who’s Using Spark?
When you have your Big Data cluster in place, you’ll be able to do lots of interesting things. From genome sequencing analysis, to digital advertising to a major credit card company who uses Spark to match thousands of transactions at once
for possible fraud detection. Cisco does something similar with a cloud-based security product to spot possible hacking before it turns into a major data breach. Geneticists use it to match genes to new medicines.

Conclusion
Apache Spark builds on Hadoop and then goes beyond it by adding stream processing capabilities. The MapR distribution is the only one that offers everything you need right out of the box to enable real-time data processing.

For a more in-depth view into how Spark and Hadoop benefit from each other, read chapter four of the free interactive ebook: Getting Started with Apache Spark: From Inception to Production, by James A. Scott.

More Stories By Jim Scott

Jim has held positions running Operations, Engineering, Architecture and QA teams in the Consumer Packaged Goods, Digital Advertising, Digital Mapping, Chemical and Pharmaceutical industries. Jim has built systems that handle more than 50 billion transactions per day and his work with high-throughput computing at Dow Chemical was a precursor to more standardized big data concepts like Hadoop.

CloudEXPO Stories
Blockchain has shifted from hype to reality across many industries including Financial Services, Supply Chain, Retail, Healthcare and Government. While traditional tech and crypto organizations are generally male dominated, women have embraced blockchain technology from its inception. This is no more evident than at companies where women occupy many of the blockchain roles and leadership positions. Join this panel to hear three women in blockchain share their experience and their POV on the future of blockchain.
Concerns about security, downtime and latency, budgets, and general unfamiliarity with cloud technologies continue to create hesitation for many organizations that truly need to be developing a cloud strategy. Hybrid cloud solutions are helping to elevate those concerns by enabling the combination or orchestration of two or more platforms, including on-premise infrastructure, private clouds and/or third-party, public cloud services. This gives organizations more comfort to begin their digital transformation without a complete overhaul of their existing infrastructure - serving as a sort of "missing link" for transition to cloud utilization.
Cloud Storage 2.0 has brought many innovations, including the availability of cloud storage services that are less expensive and much faster than previous generations of cloud storage. Cloud Storage 2.0 has also delivered new and faster methods for migrating your premises storage environment to the cloud and the concept of multi-cloud. This session will provide technical details on Cloud Storage 2.0 and the methods used to efficiently migrate from premises-to-cloud storage. This session will also discuss best practices for implementing multi-cloud environments.
In very short order, the term "Blockchain" has lost an incredible amount of meaning. With too many jumping on the bandwagon, the market is inundated with projects and use cases that miss the real potential of the technology. We have to begin removing Blockchain from the conversation and ground ourselves in the motivating principles of the technology itself; whether it is consumer privacy, data ownership, trust or even participation in the global economy, the world is faced with serious problems that this technology could ultimately help us in at least partially solving. But if we do not unpack what is real and what is not, we can lose sight of the potential. In this presentation, John Bates-who leads data science, machine learning and AI in the Adobe Analytics business unit-will present his 4-prong model of the general areas where Blockchain can have a real impact and the specific use...
FinTech is a disruptive innovation that denotes the adoption of technologies that have changed how traditional financial services work. While FinTech is now embedded deeply into the financial services ecosystem, the rise of digital age has paved way to FinTech 2.0 - which is rolling out innovative solutions through emerging technologies at a disruptive pace while maintaining the tenets of security and compliances. Blockchain as a technology has started seeing pilot adoption in FinTech around trade settlements, fraud detection and would need to sort out few of the technology challenges primarily around transaction time, interoperability with existing systems before being fully adopted into mainstream systems. While private blockchain adoption by Banks have taken shape, the challenge of real time transaction settlement, preventing double spend attacks need to be addressed.