Robin Moffatt 12/16/2016

ETL Offload with Spark and Amazon EMR - Part 2 - Code development with Notebooks and Docker

Read Original

This technical article details the code development phase for a Spark-based ETL offload project on Amazon EMR. It explains building a data pipeline using Jupyter Notebooks within a Docker container for local development, covering steps like loading S3 files, data deduplication, joining with reference data, and writing results back.

ETL Offload with Spark and Amazon EMR - Part 2 - Code development with Notebooks and Docker

Comments

No comments yet

Be the first to share your thoughts!

Browser Extension

Get instant access to AllDevBlogs from your browser