Introduction to Data Engineering Concepts | Data Lakehouse Architecture Explained
Explains the data lakehouse architecture, a unified approach combining data lake scalability with warehouse management features like ACID transactions.
Explains the data lakehouse architecture, a unified approach combining data lake scalability with warehouse management features like ACID transactions.
A monthly roundup of curated links and articles on data engineering, Kafka, CDC, stream processing, and AI/ML topics.
A guide to building a data pipeline using DuckDB, covering data ingestion, transformation, and analytics with real-world environmental data.
A monthly roundup of interesting links and articles about data engineering, databases, streaming tech, and data infrastructure.
A comprehensive 2025 guide to Apache Iceberg, covering its architecture, ecosystem, and practical use for data lakehouse management.
Argues that RAG system failures stem from data engineering issues like fragmented data and governance, not from model or vector database choices.
Overview of Overture Maps Foundation's updated global, open geospatial datasets, their partners, and data refresh strategy.
Monthly roundup of news and resources in data streaming, stream processing, and the Apache Kafka ecosystem, curated by industry experts.
An overview of Apache Flink CDC, its declarative pipeline feature, and how it simplifies data integration from databases like MySQL to sinks like Elasticsearch.
A profile of a Senior Analytics Engineer specializing in dbt, data mesh architecture, and applying library science principles to modern data teams.
Monthly roundup of news and developments in data streaming, stream processing, and the data ecosystem, featuring Apache Flink, Kafka, and open-source tools.
Explains how Parquet handles schema evolution, including adding/removing columns and changing data types, for data engineers.
An introduction to Apache Parquet, a columnar storage file format for efficient data processing and analytics.
Explains the hierarchical structure of Parquet files, detailing how pages, row groups, and columns optimize storage and query performance.
A practical guide to reading and writing Parquet files in Python using PyArrow and FastParquet libraries.
Explores using GitHub Actions for software development CI/CD and advanced data engineering tasks like ETL pipelines and data orchestration.
A former Debezium lead argues that Change Data Capture (CDC) is a feature within larger data platforms, not a standalone product.
Explores the core reasons for using Change Data Capture (CDC) to extract data from operational databases for analytics and other applications.
A comprehensive directory of Apache Iceberg resources, including tutorials, guides, and educational materials for data engineers and developers.
A technical guide on configuring Apache Flink to write data to Delta Lake tables stored on S3, including required JARs and configuration steps.