Why Apache Spark is a crossover hit for data-scientists

About | RSS | TwitterOn all things software, by Éric PETIT

Why Apache Spark is a crossover hit for data-scientists

To further read about Apache Spark, Sean Owen, from Cloudera wrote a post last March explaining the value added by Apache Spark:

The array of tools available to data scientists tells a story of unfortunate tradeoffs:

R offers a rich environment for statistical analysis and machine learning, but it has some rough edges when performing many of the data processing and cleanup tasks that are required before the real analysis work can begin. As a language, it’s not similar to the mainstream languages developers know.

Python is a general purpose programming language with excellent libraries for data analysis like Pandas and scikit-learn. But like R, it’s still limited to working with an amount of data that can fit on one machine.

It’s possible to develop distributed machine learning algorithms on the classic MapReduce computation framework in Hadoop (see Apache Mahout). But MapReduce is notoriously low-level and difficult to express complex computations in.

Apache Crunch offers a simpler, idiomatic Java API for expressing MapReduce computations. But still, the nature of MapReduce makes it inefficient for iterative computations, and most machine learning algorithms have an iterative component.

And so on. There are both gaps and overlaps between these and other data science tools. Coming from a background in Java and Hadoop, I do wonder with envy sometimes: why can’t we have a nice REPL-like investigative analytics environment like the Python and R users have? That’s still scalable and distributed? And has the nice distributed-collection design of Crunch? And can equally be used in operational contexts?

Friday, June 19, 2015

Selected Links

About | RSS | TwitterOn all things software, by Éric PETIT