Friday, September 4, 2015

Spark Overview

Spark is a cluster computing framework i.e. a framework which uses multiple workstations, multiple storage devices, and redundant interconnections, to form an abstract single highly available system. 

Spark has been imparted the following features:

- open-source under BSD licence 
- in-memory processing
- multi-language APIs in Scala, Java and Python
- rich array of parallel operators
- runnable on Apache Mesos, YARN, Amazon EC2 or in standalone mode
- best suitable for highly iterative jobs
- efficient for interactive data mining jobs

Need of Spark


Spark is a result of the fleeting ongoing developments in the Big-Data world. "Big-Data", a term that can be used to describe data which has got 3 V's to it, Volume, Variety and Velocity. To store and process Big-Data, specialized frameworks were in demand. Hadoop is a predominantly established software framework, composed of a file-system called Hadoop Distributed File System(HDFS) and Map-Reduce. Spark is a strong contender of Map-Reduce. Due to its in-memory processing capability, Spark offers lightening fast results as compared to Map-Reduce.

When to use Spark?


Apart from using Spark for common data processing applications, it should be used specifically for applications where in-memory operations are a major percentage to the processing. The following are the type of applications in which Spark specializes:

1. Iterative algorithms
2. Interactive data mining

Who is using Spark?


UC Berkeley AMPLab is the developer of Spark. Apart from Berkeley, which runs large-scale applications such as spam filtering and traffic prediction, 14 other companies including Conviva, Quantifind have contributed to Spark.

No comments:

Post a Comment