Large-Scale Data Processing Frameworks – What Is Apache Spark?

Apache Spark is the latest data processing framework from open source. It is a large-scale data processing engine that will most likely replace Hadoop’s MapReduce. Apache Spark and Scala are inseparable terms in the sense that the easiest way to begin using Spark is via the Scala shell. But it also offers support for Java and python. The framework was produced in UC Berkeley’s AMP Lab in 2009. So far there is a big group of four hundred developers from more than fifty companies building on Spark. It is clearly a huge investment.

A brief description

Apache Spark is a general use cluster computing framework that is also very quick and able to produce very high APIs. In memory, the system executes programs up to 100 times quicker than Hadoop’s MapReduce. On disk, it runs 10 times quicker than MapReduce. Spark comes with many sample programs written in Java, Python and Scala. The system is also made to support a set of other high-level functions: interactive SQL and NoSQL, MLlib(for machine learning), GraphX(for processing graphs) structured data processing and streaming. Spark introduces a fault tolerant abstraction for in-memory cluster computing called Resilient distributed datasets (RDD). This is a form of restricted distributed shared memory. When working with spark, what we want is to have concise API for users as well as work on large datasets. In this scenario many scripting languages does not fit but Scala has that capability because of its statically typed nature.

Usage tips

As a developer who is eager to use Apache Spark for bulk data processing or other activities, you should learn how to use it first. The latest documentation on how to use Apache Spark, including the programming guide, can be found on the official project website. You need to download a README file first, and then follow simple set up instructions. It is advisable to download a pre-built package to avoid building it from scratch. Those who choose to build Spark and Scala will have to use Apache Maven. Note that a configuration guide is also downloadable. Remember to check out the examples directory, which displays many sample examples that you can run.


Spark is built for Windows, Linux and Mac Operating Systems. You can run it locally on a single computer as long as you have an already installed java on your system Path. The system will run on Scala 2.10, Java 6+ and Python 2.6+.

Spark and Hadoop

The two large-scale data processing engines are interrelated. Spark depends on Hadoop’s core library to interact with HDFS and also uses most of its storage systems. Hadoop has been available for long and different versions of it have been released. So you have to create Spark against the same sort of Hadoop that your cluster runs. The main innovation behind Spark was to introduce an in-memory caching abstraction. This makes Spark ideal for workloads where multiple operations access the same input data.

Users can instruct Spark to cache input data sets in memory, so they don’t need to be read from disk for each operation. Thus, Spark is first and foremost in-memory technology, and hence a lot faster.It is also offered for free, being an open source product. However, Hadoop is complicated and hard to deploy. For instance, different systems must be deployed to support different workloads. In other words, when using Hadoop, you would have to learn how to use a separate system for machine learning, graph processing and so on.

With Spark you find everything you need in one place. Learning one difficult system after another is unpleasant and it won’t happen with Apache Spark and Scala data processing engine. Each workload that you will choose to run will be supported by a core library, meaning that you won’t have to learn and build it. Three words that could summarize Apache spark include quick performance, simplicity and versatility.

Source by Stephen Abbott

Leave a Reply

Your email address will not be published. Required fields are marked *