Evaluation articles

3/4/2026 • EN

Practical Guide to Evaluating and Testing Agent Skills

A guide to systematically evaluating and testing AI agent skills, covering success criteria, building an evaluation harness, and improving skill performance.

Agent Skills AI Agents Evaluation Gemini API testing

Philipp Schmid

11/27/2025 • EN

Quoting Qwen3-VL Technical Report

Technical report on Qwen3-VL's video processing capabilities, achieving near-perfect accuracy in long-context needle-in-a-haystack evaluations.

computer vision Evaluation Long Context Multimodal AI Positional Encoding

Simon Willison

7/19/2025 • EN

Intuition in AI

Argues that reading raw AI input/output data is essential for developing true intuition about system behavior, beyond just metrics.

AI Agents ai development data analysis Evaluation testing

Neal Lathia

5/26/2024 • EN

Prompting Fundamentals and How to Apply them Effectively

Explains core prompting fundamentals for effective LLM use, including mental models, role assignment, and practical workflow with examples.

Chain Of Thought Claude API Evaluation large language models prompt engineering

Eugene Yan

4/10/2022 • EN

Counterfactual Evaluation for Recommendation Systems

Explores counterfactual evaluation as an alternative to A/B testing for offline assessment of recommendation systems.

Counterfactual Evaluation Machine Learning Offline Evaluation recommendation systems

Eugene Yan

Evaluation Articles

Practical Guide to Evaluating and Testing Agent Skills

Quoting Qwen3-VL Technical Report

Intuition in AI

Prompting Fundamentals and How to Apply them Effectively

Counterfactual Evaluation for Recommendation Systems

Select Language

We use cookies