Data Engineering articles

6/2/2022 • EN

Data what??

A guide explaining key data engineering terms like data warehouses, data lakes, data mesh, and data pipelines, with definitions and comparisons.

Data Engineering Data Fabric Data Lake Data Mesh Data Warehouse

Rob Koch

1/27/2022 • EN

Use external Hive Metastore for Synapse Spark Pool

Guide on configuring an external Apache Hive metastore with Azure SQL for use in an Azure Synapse Analytics Spark Pool, troubleshooting common connection errors.

Azure SQL Azure Synapse Data Engineering Hive Metastore Spark

Benjamin Perkins

11/2/2021 • EN

Debezium and Friends – Conference Talks 2021

A recap of 2021 conference talks on Debezium and Change Data Capture (CDC), exploring patterns and integrations with tools like Kafka and Pinot.

Apache Kafka change data capture Data Engineering Debezium distributed systems

Gunnar Morling

11/2/2021 • EN

Debezium and Friends – Conference Talks 2021

A recap of 2021 conference talks on Debezium and Change Data Capture (CDC), exploring patterns and integrations with tools like Kafka and Infinispan.

Apache Kafka change data capture Data Engineering Debezium distributed systems

Gunnar Morling

7/11/2021 • EN

Data Fluent for PostgreSQL

Introducing Data Fluent, an open-source Python package for analyzing and understanding PostgreSQL database structure, row counts, and growth trends.

Data Engineering Database Analysis open source postgresql Python

Mark Litwintschik

5/15/2021 • EN

New MongoDB Aggregations book is out

Announcing the free release of 'Practical MongoDB Aggregations', a book with tips and examples for developers and data professionals.

Aggregations book Data Engineering database mongodb

Paul Done

2/21/2021 • EN

Feature Stores: A Hierarchy of Needs

Explores the concept of feature stores in machine learning, presenting a hierarchy of needs from basic access to full automation.

AWS Sagemaker Data Engineering Feature Stores Machine Learning Mlop

Eugene Yan

2/2/2021 • EN

Performing a GROUP BY on data in bash

Using bash shell tools like kafkacat, jq, sort, and uniq to perform a GROUP BY-style analysis on data from a Kafka topic.

bash Data Engineering Jq Kafkacat Unix Pipelines

Robin Moffatt

10/25/2020 • EN

Data Discovery Platforms and Their Open Source Solutions

An analysis of data discovery platforms, their key features, and available open-source solutions to improve data findability in organizations.

Data Catalog Data Discovery Data Engineering Metadata Management open source

Eugene Yan

8/9/2020 • EN

Unpopular Opinion: Data Scientists Should be More End-to-End

Argues that data scientists should own the entire process from problem identification to solution deployment for greater impact and efficiency.

Data Engineering Datascience full stack Machinelearning Mlop

Eugene Yan

7/5/2020 • EN

My Notes From Spark+AI Summit 2020 (Application-Specific Talks)

Notes from Spark+AI Summit 2020 covering application-specific talks on ML frameworks, data engineering, feature stores, and data quality from companies like Airbnb and Netflix.

Data Engineering Feature Engineering Machine Learning production Spark

Eugene Yan