Saturday, January 17, 2015

Apache Spark

Why Spark Is the Next Top (Compute) Model @ InfoQ
"Dean Wampler argues that Spark/Scala is a better data processing engine than MapReduce/Java because tools inspired by mathematics, such as FP, are ideal tools for working with data."

Advantage of Spark over Hadoop is that it does not need to save data to disk after each step like Map/Reduce, providing significant performance gain (sometimes 100x). He suggest that Spark is to Hadoop what Spring is to J2EE, a significant improvement and simplification.

Spark is written in Scala, but usable from Java and Python,
as well as variations of SQL (HiveQL).
It also includes modules for Machine Learning.
Compute Model: "RDD" Resilient Distributed Dataset.

Unified Big Data Processing with Apache Spark  @ InfoQ

Apache Spark 1.2.0 Supports Netty-based Implementation, High Availability and Machine Learning APIs

Use Script Action in HDInsight to install Spark on Hadoop cluster| Azure

Spark, Storm and Real Time Analytics

Apache Spark™ - Lightning-Fast Cluster Computing



No comments: