Big Data articles

1/30/2026 • EN

Microsoft's 2026 Global ML Building Footprints

Analysis of Microsoft's 2026 Global ML Building Footprints dataset, including technical setup and data exploration using DuckDB and QGIS.

Big Data Dataset Duckdb Geospatial Data Machine Learning

Mark Litwintschik

4/10/2025 • EN

PySpark 101: Introduction to Big Data with Spark

A beginner-friendly introduction to using PySpark for big data processing with Apache Spark, covering the fundamentals.

Apache Spark Big Data distributed computing Pyspark Python

Matt Layman

3/9/2025 • EN

9 new books added to Big Book of R

Announces 9 new free and paid books added to the Big Book of R collection, covering data science, visualization, and package development.

Big Data Data Science Package Development R Programming Statistical Computing

Oscar Baruffa

1/20/2025 • EN

2025 Comprehensive Guide to Apache Iceberg

A comprehensive 2025 guide to Apache Iceberg, covering its architecture, ecosystem, and practical use for data lakehouse management.

Apache Iceberg Big Data Data Engineering Data Lakehouse Table Format

Alex Merced

10/21/2024 • EN

All About Parquet Part 03 - Parquet File Structure | Pages, Row Groups, and Columns

Explains the hierarchical structure of Parquet files, detailing how pages, row groups, and columns optimize storage and query performance.

Big Data Columnar Storage Data Engineering File Format Parquet

Alex Merced

10/21/2024 • EN

All About Parquet Part 10 - Performance Tuning and Best Practices with Parquet

Final guide in a series covering performance tuning and best practices for optimizing Apache Parquet files in big data workflows.

Big Data Data Compression Data Lake Parquet performance tuning

Alex Merced

10/21/2024 • EN

All About Parquet Part 09 - Parquet in Data Lake Architectures

Explores why Parquet is the ideal columnar file format for optimizing storage and query performance in modern data lake and lakehouse architectures.

Apache Iceberg Big Data Columnar Storage Data Lake Parquet

Alex Merced

10/21/2024 • EN

All About Parquet Part 01 - An Introduction

An introduction to Apache Parquet, a columnar storage file format for efficient data processing and analytics.

Big Data Columnar Storage Data Engineering Data Format Parquet

Alex Merced

4/4/2024 • EN

A Deep Intro to Apache Iceberg and Resources for Learning More

An introduction to Apache Iceberg, a table format for data lakehouses, explaining its architecture and providing learning resources.

Apache Iceberg Big Data Data Engineering Data Lakehouse Table Format

Alex Merced

2/13/2024 • EN

Datacast Episode 132: Big Data Engineering, Data Culture from First Principles, and Reimagined Metadata with Suresh Srinivas

Interview with Suresh Srinivas on his career in big data, founding Hortonworks, scaling Uber's data platform, and leading the OpenMetadata project.

Apache Hadoop Big Data Data Engineering metadata Openmetadata

James Le

2/12/2024 • EN

Partitioning Practices in Apache Hive and Apache Iceberg

Compares partitioning techniques in Apache Hive and Apache Iceberg, highlighting Iceberg's advantages for query performance and data management.

Apache Hive Apache Iceberg Big Data Data Partitioning Query Optimization

Alex Merced

1/3/2024 • EN

1️⃣🐝🏎️🦆 (1BRC in SQL with DuckDB)

A technical guide to solving the One Billion Row Challenge (1BRC) using SQL and DuckDB, including data loading and aggregation.

Big Data data processing Duckdb performance optimization sql

Robin Moffatt

1/1/2024 • EN

The One Billion Row Challenge

A Java programming challenge to process one billion rows of temperature data, focusing on performance optimization and modern Java features.

benchmarking Big Data concurrency Java performance optimization

Gunnar Morling

11/16/2023 • EN

Learning Apache Flink S01E06: The Flink JDBC Driver

Exploring the two JDBC driver options for connecting to Apache Flink: the new Flink JDBC driver and the Hive JDBC driver via the SQL Gateway.

Apache Flink Big Data Data Streaming jdbc SQL Gateway

Robin Moffatt

10/2/2023 • EN

Learning Apache Flink S01E02: What is Flink?

An introductory overview of Apache Flink, explaining its core concepts as a distributed stream processing framework, its history, and primary use cases.

Apache Flink Big Data Data Engineering distributed systems Stream Processing

Robin Moffatt

9/29/2023 • EN

Learning Apache Flink S01E01: Where Do I Start?

A developer's personal journey and structured plan for learning Apache Flink, a stream processing framework, starting from the basics.

Apache Flink Big Data Data Streaming distributed systems Stream Processing

Robin Moffatt

11/22/2022 • EN

Understanding Spark Configurations with Apache Iceberg

A guide to configuring Apache Spark for use with the Apache Iceberg table format, covering packages, flags, and programmatic setup.

Apache Iceberg Apache Spark Big Data Data Lake Spark Configurations

Alex Merced

9/14/2022 • EN

Stretching my Legs in the Data Engineering Ecosystem in 2022

A data engineer explores the evolution of the data ecosystem, comparing past practices with modern tools and trends in 2022.

Apache Kafka Big Data Data Engineering Data Warehousing Stream Processing

Robin Moffatt

11/24/2019 • EN

Data is Overrated*

Argues that raw data is overvalued without proper context and conversion into meaningful information and knowledge.

ai Big Data Data Information Knowledge

Niko Neugebauer

10/13/2018 • EN

Approximate Distinct Count

Explains the APPROX_COUNT_DISTINCT function for faster, memory-efficient distinct counts in SQL, comparing it to exact COUNT(DISTINCT).

algorithm Approximate Distinct Count Big Data data processing Hyperloglog

Niko Neugebauer

Big Data Articles

Microsoft's 2026 Global ML Building Footprints

PySpark 101: Introduction to Big Data with Spark

9 new books added to Big Book of R

2025 Comprehensive Guide to Apache Iceberg

All About Parquet Part 03 - Parquet File Structure | Pages, Row Groups, and Columns

All About Parquet Part 10 - Performance Tuning and Best Practices with Parquet

All About Parquet Part 09 - Parquet in Data Lake Architectures

All About Parquet Part 01 - An Introduction

A Deep Intro to Apache Iceberg and Resources for Learning More

Datacast Episode 132: Big Data Engineering, Data Culture from First Principles, and Reimagined Metadata with Suresh Srinivas

Partitioning Practices in Apache Hive and Apache Iceberg

1️⃣🐝🏎️🦆 (1BRC in SQL with DuckDB)

The One Billion Row Challenge

Learning Apache Flink S01E06: The Flink JDBC Driver

Learning Apache Flink S01E02: What is Flink?

Learning Apache Flink S01E01: Where Do I Start?

Understanding Spark Configurations with Apache Iceberg

Stretching my Legs in the Data Engineering Ecosystem in 2022

Data is Overrated*

Approximate Distinct Count

Select Language

We use cookies