Philipp Schmid • 10/15/2025

AI Agent Benchmark Compendium

This article presents a compendium of over 50 benchmarks for evaluating AI agents, organized into four key categories: Function Calling & Tool Use, General Assistant & Reasoning, Coding & Software Engineering, and Computer Interactions. It provides descriptions, links to papers, GitHub repositories, and leaderboards for major benchmarks like BFCL, ToolBench, and τ-Bench, serving as a technical reference for developers and researchers.

0 comments

#AI Agents #Benchmarks #Function Calling