Hadoop Streaming with Amazon Elastic MapReduce, Python and mrjob
Read OriginalThis article details a solution for classifying hundreds of millions of URLs by content without visiting them. It explains the limitations of local processing, introduces the concept of embarrassingly parallel problems, and provides a technical walkthrough using Python, the mrjob library, and Amazon Elastic MapReduce (EMR) to run the task on a Hadoop cluster for massive scalability.
Comments
No comments yet
Be the first to share your thoughts!
Browser Extension
Get instant access to AllDevBlogs from your browser