Hadoop, NY Times and Open Source Libraries

nosql:

I guess everyone with some interest in Hadoop already knows the story of NY Times converting more than 130 years worth of articles (11 million articles in TIFF format) into PDFs using Hadoop and Amazon EC2 [1]. What I didn’t know though is that this wasn’t an one-time only project, NY Times continuing to use Hadoop for other projects [2] and that they open sourced [3] the Map/Reduce Toolkit (MRToolkit) [4] project for use with a not so well known feature: Hadoop Streaming [5]

It takes care of the details of setting up and running Apache Hadoop jobs, and encapsulates most of the complexity of writing map and reduce steps. The toolkit, which is Ruby-based, provides the framework — you only have to supply the details of the map and reduce steps.

There is also another Ruby library for Hadoop streaming: ☞ wukong which simplifies the data interaction layer:

Treat your dataset like a

  • stream of lines when it’s efficient to process by lines
  • stream of field arrays when it’s efficient to deal directly with fields
  • stream of lightweight objects when it’s efficient to deal with objects

Do you have any favorite library that you use with Hadoop? Is it in our NoSQL libraries list?

References

This was posted 2 years ago. It has 1 notes.
Tags:  nosql 
  1. infynyxx reblogged this from nosql
  2. nosql posted this

Hi, I'm Prajwal and this is my personal blog mainly focusing on some random things directy or indirectly affecting my life with some extra touch on finance and economy. I also have a tech blog, kind of my professional blog. I work @ collegehumor.com Sailthru, a NYC based startup as a back end developer.

I grew up in Kathmandu and currently live in Queens, NY (NYC).


Copyright Prajwal Tuladhar.

Some Rights Reserved