Spark is a cluster computing framework i.e. a framework which uses multiple workstations, multiple storage devices, and redundant interconnections, to form an abstract single highly available system.
Spark has been imparted the following features:
- open-source under BSD licence
- in-memory processing
- multi-language APIs in Scala, Java and Python
- rich array of parallel operators
- runnable on Apache Mesos, YARN, Amazon EC2 or in standalone mode
- best suitable for highly iterative jobs
- efficient for interactive data mining jobs
Spark is a result of the fleeting ongoing developments in the Big-Data world. "Big-Data", a term that can be used to describe data which has got 3 V's to it, Volume, Variety and Velocity. To store and process Big-Data, specialized frameworks were in demand. Hadoop is a predominantly established software framework, composed of a file-system called Hadoop Distributed File System(HDFS) and Map-Reduce. Spark is a strong contender of Map-Reduce. Due to its in-memory processing capability, Spark offers lightening fast results as compared to Map-Reduce.
Apart from using Spark for common data processing applications, it should be used specifically for applications where in-memory operations are a major percentage to the processing. The following are the type of applications in which Spark specializes:
1. Iterative algorithms
2. Interactive data mining
UC Berkeley AMPLab is the developer of Spark. Apart from Berkeley, which runs large-scale applications such as spam filtering and traffic prediction, 14 other companies including Conviva, Quantifind have contributed to Spark.
Spark has been imparted the following features:
- open-source under BSD licence
- in-memory processing
- multi-language APIs in Scala, Java and Python
- rich array of parallel operators
- runnable on Apache Mesos, YARN, Amazon EC2 or in standalone mode
- best suitable for highly iterative jobs
- efficient for interactive data mining jobs
Need of Spark
Spark is a result of the fleeting ongoing developments in the Big-Data world. "Big-Data", a term that can be used to describe data which has got 3 V's to it, Volume, Variety and Velocity. To store and process Big-Data, specialized frameworks were in demand. Hadoop is a predominantly established software framework, composed of a file-system called Hadoop Distributed File System(HDFS) and Map-Reduce. Spark is a strong contender of Map-Reduce. Due to its in-memory processing capability, Spark offers lightening fast results as compared to Map-Reduce.
When to use Spark?
Apart from using Spark for common data processing applications, it should be used specifically for applications where in-memory operations are a major percentage to the processing. The following are the type of applications in which Spark specializes:
1. Iterative algorithms
2. Interactive data mining
Who is using Spark?
UC Berkeley AMPLab is the developer of Spark. Apart from Berkeley, which runs large-scale applications such as spam filtering and traffic prediction, 14 other companies including Conviva, Quantifind have contributed to Spark.
No comments:
Post a Comment