Statistics articles

10/31/2019 • EN

The secular Bayesian: Using belief distributions without really believing

A data scientist's journey from dogmatic Bayesianism to a pragmatic, 'secular' use of Bayesian tools without requiring belief in the model's literal existence.

Bayesian Inference Data Science Machine Learning Modeling statistics

Ferenc Huszár

10/1/2019 • EN

Some things I don’t like about the Oxford-Munich Code of Conduct

A critique of the Oxford-Munich Code of Conduct for Data Scientists, focusing on its technical recommendations on sampling and data retention.

Code Of Conduct Data Science Ethics sampling statistics

Thomas Lumley

8/11/2019 • EN

NumPy Exercises Part 1

Explains the theory behind linear regression models, a fundamental machine learning algorithm for predicting continuous numerical values.

Linear Regression Machine Learning Numpy Python statistics

Stern Semasuka

6/20/2019 • EN

Updating Statistics on Secondary Replicas of the Availability Groups

A technical guide exploring workarounds to update SQL Server statistics on secondary replicas in Availability Groups, including scripts and methods.

Availability Groups Database Administration Secondary Replicas SQL Server statistics

Niko Neugebauer

6/16/2019 • EN

Analysing the mouse microbiome autism data

A statistical re-analysis of a published study on the mouse microbiome and autism, examining data and p-values from behavioral experiments.

Autism Research data analysis Microbiome R statistics

Thomas Lumley

6/13/2019 • EN

Logistic Regression from Bayes' Theorem

Explains the mathematical derivation of logistic regression from Bayes' theorem, connecting fundamental statistics to machine learning.

Bayes Theorem Logistic Regression Machine Learning Probability statistics

Will Kurt

6/11/2019 • EN

Confidence intervals: not a very strong property

A statistical analysis discussing the limitations of confidence intervals, using examples from small-area sampling to illustrate their weak properties.

Bayesian Inference Confidence Intervals data analysis sampling statistics

Thomas Lumley

4/30/2019 • EN

What does a Data Scientist really do?

A data scientist clarifies common misconceptions about the field, explaining that machine learning is only a small part of the job and advanced degrees aren't always required.

Career data analysis Datascience Machine Learning statistics

Eugene Yan

3/4/2019 • EN

Normal horizontiles

A technical analysis verifying a statistical calculation from an XKCD comic, involving normal distribution probabilities and R code.

Integration Normal Distribution Probability R Programming statistics

Thomas Lumley

3/1/2019 • EN

Displaying bus punctuality

A technical analysis of bus punctuality using Auckland Transport API data, with R code for data processing and visualization.

api data analysis R statistics Visualization

Thomas Lumley

1/31/2019 • EN

A Deeper look at Mean Squared Error

A technical exploration of Mean Squared Error, breaking it down into bias and variance to understand model performance and irreducible uncertainty.

Bias Variance Tradeoff Machine Learning Mean Squared Error Model Evaluation statistics

Will Kurt

1/29/2019 • EN

Half a dozen frequentist and Bayesian ways to measure the difference in means in two groups

A guide to six statistical methods (frequentist and Bayesian) for comparing group means, with R and Stan code examples.

Bayesian Inference data analysis Frequentist Inference R statistics

Andrew Heiss

1/11/2019 • EN

The Ihaka Lectures 3: Rise of the Machine Learners

Announcement for a lecture series on machine learning, covering topics like Weka, deep learning, algorithmic fairness, and sparse supervised learning.

Algorithmic Fairness Data Science Machine Learning statistics Supervised Learning

Thomas Lumley

12/5/2018 • EN

How to test any hypothesis with the infer package

A tutorial on using the infer package in R for hypothesis testing through simulation, following a modern statistical approach.

Hypothesis Testing Infer Package R statistics Tidyverse

Andrew Heiss

10/4/2018 • EN

The Kiwi PRNG

Analysis of a bug in New Zealand's official pseudo-random number generator used for electoral vote counting, based on the Wichmann-Hill algorithm.

algorithm bug Pseudorandom Number Generator statistics Wichmann Hill

Thomas Lumley

9/30/2018 • EN

Columnstore Indexes – part 126 (“Extracting Columnstore Statistics to Cloned Database”)

Explores SQL Server 2019's improved DBCC CLONEDATABASE command for automatically extracting Columnstore Index statistics into a cloned database.

Columnstore Indexes Database Cloning Dbcc Clonedatabase SQL Server statistics

Niko Neugebauer

9/13/2018 • EN

The Waiting Time Paradox, or, Why Is My Bus Always Late?

Explores the 'waiting time paradox' using probability, simulation, and real bus data to explain why average wait times often exceed the scheduled interval.

data analysis Inspection Paradox Probability simulation statistics

Jake VanderPlas