Statistics articles

8/14/2016 • EN

Simulations and modes of convergence

Discusses why simulation summaries should focus on quantiles and robust statistics rather than moments when evaluating asymptotic approximations.

Asymptotics Convergence Maximum Likelihood simulation statistics

Thomas Lumley

7/28/2016 • EN

One scoRe years

The author reflects on R's rise in programming language rankings and its unexpected adoption across diverse fields over 20 years.

data analysis programming languages R Software Rankings statistics

Thomas Lumley

6/4/2016 • EN

Computing the (simplest) sandwich estimator incrementally

Explains how to compute the Huber/White sandwich estimator incrementally in R's biglm package for large-scale linear regression.

Incremental Computation Linear Regression R Sandwich Estimator statistics

Thomas Lumley

3/20/2016 • EN

The conservative Bonferroni correction

Explores the surprising effectiveness and conservative nature of the Bonferroni correction for multiple hypothesis testing, even with many tests.

Bonferroni Correction Confidence Intervals Multiple Testing statistics Type I Error

Thomas Lumley

3/15/2016 • EN

Data science intro for math/phys background

A guide for academics with math/physics backgrounds transitioning into data science, covering skills, learning paths, and practical advice.

Data Science data visualization Machine Learning Python statistics

Piotr Migdał

1/20/2016 • EN

Is it that time of day?

A data analysis of a radio station's song rotation patterns using vector math and statistical methods to test anecdotal claims about repetitive playtimes.

data analysis data visualization statistics Time Series Vector Analysis

Thomas Lumley

1/13/2016 • EN

What does ‘design-consistent’ even mean?

Explores the statistical concept of 'design consistency' in survey sampling, comparing it to model consistency and discussing asymptotic theory.

Asymptotics Design Consistency Estimation Model Consistency statistics

Thomas Lumley

12/14/2015 • EN

A simple probability problem

Analyzing a classic probability problem involving dice rolls, its historical context with Newton and Pepys, and the mathematical intuition behind it.

Binomial Distribution data analysis mathematics Probability statistics

Thomas Lumley

9/22/2015 • EN

NZ Flag Referendum pseudorandom numbers

Analyzes the pseudorandom number generator defined in NZ Flag Referendum law, comparing it to the Wichmann-Hill algorithm and noting a potential flaw.

algorithm Legislation Pseudorandom Number Generator statistics Wichmann Hill

Thomas Lumley

9/14/2015 • EN

Good reasons for assuming a spherical cow

Explores valid reasons for using simplified assumptions like 'spherical cows' in statistical modeling and theoretical work.

Assumptions Computational Methods Modeling statistics Theory

Thomas Lumley

8/29/2015 • EN

Net Reclassification Index: surprisingly weird.

A technical critique of the Net Reclassification Index (NRI), a statistical measure for evaluating prediction model improvements, highlighting its surprising biases.

Biostatistics classification Net Reclassification Index prediction models statistics

Thomas Lumley

6/20/2015 • EN

A much-needed gap

Critique of using Shapiro-Wilk normality tests on large, complex survey data like NHANES, explaining why it's statistically inappropriate.

data analysis Normality Testing Sampling Methodology Shapiro Wilk Test statistics

Thomas Lumley

5/20/2015 • EN

First Steps with Structural Equation Modeling

A guide to getting started with Structural Equation Modeling (SEM) in R using the Lavaan package, based on a user group presentation.

data analysis Lavaan R statistics Structural Equation Modeling

Noam Ross

5/3/2015 • EN

What’s the right proof of the Continuous Mapping Theorem?

Explores different proofs of the Continuous Mapping Theorem in probability theory, discussing their merits and pedagogical value.

Asymptotics Continuous Mapping Theorem Convergence In Distribution Probability Theory statistics

Thomas Lumley

3/29/2015 • EN

Reading citations is easier than most people think

The article debunks common misinterpretations of the Dunning-Kruger effect by analyzing the original study's data and findings.

data analysis research methodology scientific studies statistics

Dan Luu

3/15/2015 • EN

An introduction to ggplot with Myfanwy Johnston

A tutorial introducing the ggplot2 package for data visualization in R, presented at a user group meeting.

data visualization Ggplot2 programming R statistics

Noam Ross

3/7/2015 • EN

What does measurability mean?

A philosophical and technical exploration of the practical meaning of measurability in mathematical statistics, questioning its necessity for real-world data analysis.

Asymptotic Theory Mathematical Proofs Measurability Probability Theory statistics

Thomas Lumley

1/15/2015 • EN

2014 Year in Review

Author's 2014 review: writing a data science book from scratch in Python and preparing for/starting a software engineering job at Google.

algorithm Data Science Machine Learning Python statistics

Joel Grus

7/19/2014 • EN

Dixon's Q test for outlier identification

A technical guide to Dixon's Q test for identifying outliers in small datasets, including its method, application, and criticisms.

data analysis Dixons Q Test Outlier Detection Small Sample Sizes statistics

Sebastian Raschka

6/12/2014 • EN

Frequentism and Bayesianism III: Confidence, Credibility, and why Frequentism and Science do not Mix

Explores the critical difference between frequentist confidence intervals and Bayesian credible regions, arguing why frequentism often fails scientific inquiry.

Bayesian Inference Confidence Intervals Credible Regions Frequentist Inference statistics

Jake VanderPlas