Testing Data Pipelines: What to Validate and When
Explains the importance of automated testing for data pipelines, covering schema validation, data quality checks, and regression testing.
Explains the importance of automated testing for data pipelines, covering schema validation, data quality checks, and regression testing.
Explains idempotent data pipelines, patterns like partition overwrite and MERGE, and how to prevent duplicate data during retries.
A guide to designing reliable, fault-tolerant data pipelines with architectural principles like idempotency, observability, and DAG-based workflows.
A guide to the core principles and systems thinking required for data engineering, beyond just learning specific tools.
A technical article on how visibility and communication, not just speed, are critical for engineering team success and stakeholder trust.
A developer's journey to understanding AI agents and the Model Context Protocol (MCP), moving beyond traditional data pipeline thinking.
A monthly roundup of curated links and articles focused on data engineering, Apache Kafka, and data platform technologies.
A guide comparing Apache Flink SQL, Kafka Connect, and Confluent Tableflow for moving data from Apache Kafka to Apache Iceberg tables.
Explains core data engineering concepts, comparing ETL and ELT data pipeline strategies and their use cases.
An introductory guide to data engineering, explaining its role, key concepts, and how it differs from data science in the modern data ecosystem.
Explores how DevOps principles like CI/CD, infrastructure as code, and monitoring are applied to data engineering for reliable, scalable data pipelines.
Explains batch processing fundamentals for data engineering, covering concepts, tools, and its ongoing relevance in data workflows.
Explores the importance of data quality and validation in data engineering, covering key dimensions and tools for reliable pipelines.
Explains streaming data fundamentals, how streaming systems work, their use cases, and challenges compared to batch processing.
Overview of a university-level Data Engineering course syllabus covering tools, pipelines, AI pair programming, and project-based learning for Fall 2024.
A tutorial on setting up and running PyFlink streaming data jobs on a Kubernetes cluster, including installation and deployment steps.
A tutorial on setting up and running PyFlink streaming data jobs on a Kubernetes cluster, including prerequisites and deployment steps.
Explores how Azure services like Data Factory, Databricks, and Machine Learning enable DataOps for streamlined, automated data pipelines.
Explores essential design patterns for building efficient and maintainable machine learning systems in production, focusing on data pipelines and best practices.
An overview of Apache Kafka, explaining its core concepts as a distributed event streaming platform for real-time data pipelines.