Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

Big Data Digest: How many Hadoops do we need?

Joab Jackson | Jan. 19, 2015
This week brought a new data processing framework and computers that are more intimate with your feelings.

big data trends

Say hello to Flink, the newest distributed data analysis engine on the scene.

This week, the Apache Software Foundation announced Apache Flink as its newest Top-Level Project (TLP). Apache also provides a home for Hadoop, Cassandra, Lucene and many widely used open source data processing tools, so Flink's entry into the group speaks well for its technical chops.

Don't worry if you hadn't heard of Flink before -- it came to a surprise to us as well. Like Spark, another emerging data processing platform, Flink can ingest both batch data and streaming data. Apache Flink got its start as a research project at the Technical University of Berlin in 2009.

Why would someone choose Flink over Hadoop? Performance and ease of use, say the creators of the software.

The Flink engine exploits data streaming and in-memory processing to improve processing speed, said Kostas Tzoumas, a contributor to the project. Tzoumas is cofounder and CEO of data Artisans, a spin-off company that will commercialize Flink. It could serve as an ideal replacement for Hadoop for those who want faster performance.

Another advantage Flink offers is ease of use, Tzoumas said. Especially for large projects, the APIs (application programming interfaces) are an "order of magnitude" easier to use than programming for Hadoop's MapReduce, according to Tzoumas. APIs are provided for Java and Scala.

Music streaming service Spotify and travel software provider Amadeus are both testing the software, and it's been pressed into production at ResearchGate, a social network for scientists.

Nonetheless, with Hadoop and Spark growing in popularity, Flink may face an uphill battle when it comes gaining users.

"Projects that depend on smart optimizers rarely work well in real life," wrote Curt Monash, head of IT analyst consultancy Monash Research, in an e-mail. He pointed to other projects relying on performance enhancing tweaks that failed to gain traction, such as IBM Learning Optimizer for DB2, and HP's NeoView data warehouse appliance.

Elsewhere, researchers at the Massachusetts Institute of Technology (MIT) are looking at ways to use data to help better plan routine tasks such as scheduling flights or helping mapping software find the best route through a crowded city.

Later this month, MIT researchers will present a set of mew algorithms at the annual meeting of the Association for the Advancement of Artificial Intelligence (AAAI) that can plot the best route through a set of constraints.

Unlike current software that does this -- think automated airline reservation systems -- these algorithms can assess risk. For someone looking to get across town on a number of busses, it can weigh how often those busses are late and suggest alternatives where they make sense. The work is rooted in graph theory, which focuses on connections across multiple entities.

 

1  2  Next Page 

Sign up for MIS Asia eNewsletters.