Sunday, June 07, 2015

Big Data, MapReduce: IIS+ASP.NET vs. Hadoop vs. node.js vs DataFlow

Big Data - MapReduce Without Hadoop Using the ASP.NET Pipeline
"...created a very simple infrastructure that can use MapReduce to either do computationally intensive processing out on the “mesh” nodes or, alternatively, do data collection out on those nodes, with the results being correlated and aggregated into one final result that’s returned to the client."
Emulate Distributed Computing Patterns Such as “Scatter-Gather”

Node.js Streaming MapReduce with Amazon EMR - AWS Big Data Blog

mapred (npm for node.js)

MapReduce in MongoDB for Node.js Code Example - Runnable

algorithm - MapReduce alternatives - Stack Overflow

Google Re-Imagines MapReduce, Launches DataFlow
"It’s well known in the industry that more than 10 years ago Google invented MapReduce, the technology at the heart of first-generation Hadoop. It’s less well known that Google moved away from MapReduce several years ago. Today at its Google I/O 2014 conference, the Web giant unveiled a possible successor to MapReduce called Dataflow, which it’s selling through its hosted cloud service.

Google Cloud Dataflow is a managed service for creating data pipelines that ingest, transform, and analyze massive amounts of data, up into the exabyte range. The same work done in the Dataflow SDK can be used for either batch or streaming analytics, Google says. Dataflow is based internal Google technologies like Flume and MillWheel, and can be thought of as a successor to MapReduce that’s especially well-suited for massive ETL jobs."


dataflow


As Hadoop 2.0 is generalizing its infrastructure for any distributed processing not just MapReduce, there are efforts to utilize already general tools like IIS web server for processing "Big Data" also.
It would be interesting to compare real performance of those alternative solutions. While network latency overhead may dominate processing time, efficiency is still quite important.

MapReduce is not the only way to process big / distributed data, and increasingly not an optimal way either. It is useful to have choices, as long as they are not too complex.