Randy Zwitch • 7/31/2013

Hadoop Streaming with Amazon Elastic MapReduce, Python and mrjob

This article details a solution for classifying hundreds of millions of URLs by content without visiting them. It explains the limitations of local processing, introduces the concept of embarrassingly parallel problems, and provides a technical walkthrough using Python, the mrjob library, and Amazon Elastic MapReduce (EMR) to run the task on a Hadoop cluster for massive scalability.

0 comments

#Python #Hadoop Streaming #Amazon Elastic Mapreduce