Practical Guide to Evaluating and Testing Agent Skills
A guide to systematically evaluating and testing AI agent skills, covering success criteria, building an evaluation harness, and improving skill performance.
Philipp Schmid is a Staff Engineer at Google DeepMind, building AI Developer Experience and DevRel initiatives. He specializes in LLMs, RLHF, and making advanced AI accessible to developers worldwide.
189 articles from this blog
A guide to systematically evaluating and testing AI agent skills, covering success criteria, building an evaluation harness, and improving skill performance.
A guide to writing effective AGENTS.md files for AI coding agents, based on research data and best practices.
Explains the difference between an AI agent's inner loop (verifying work within a task) and outer loop (learning across tasks).
Guide to using multimodal function calling with Gemini 3 and the Interactions API to build AI agents that can process and analyze images.
A guide to using the Gemini Deep Research API for complex research tasks, including polling and streaming methods with code examples.
Introduces the Agent Client Protocol (ACP), an open standard for unifying communication between AI coding agents and code editors.
A quick start guide for Google's Gemini Interactions API, covering setup, stateful conversations, and multimodal interactions.
Explains why MCP servers often fail and provides best practices for building effective MCP servers by treating them as AI agent interfaces, not REST API wrappers.
A technical guide on generating transparent PNG stickers using the Gemini API with chromakey green and HSV color detection for clean background removal.
A guide to building AI agents using the Gemini Interactions API, covering core concepts and a step-by-step CLI implementation.
Introducing mcp-cli, a lightweight CLI tool for efficient, dynamic discovery and interaction with MCP servers, drastically reducing token usage for AI agents.
Explains the concept of an Agent Harness, a system for managing reliable, long-running AI agents, and its growing importance in AI development.
A software engineer's predictions for AI trends in 2026, covering generative UI, edge-based agents, smart homes, and the evolving role of engineers.
Explores advanced Context Engineering techniques for AI agents, focusing on combating Context Rot and improving multi-agent coordination.
Senior engineers struggle with AI agent development due to ingrained deterministic habits, contrasting with the probabilistic nature of agent engineering.
A step-by-step tutorial on building a functional AI agent using the Gemini 3 Pro model and Python, covering core concepts like tools, loops, and context.
Best practices and structural patterns for effectively prompting the Gemini 3 AI model, focusing on directness, logic, and clear instruction.
A tutorial on using the Gemini API's File Search feature for RAG in web development with JavaScript/TypeScript.
A tutorial on building an AI agent using Google's Gemini, n8n workflow automation, and deploying it on Google Cloud Run with a PostgreSQL database.
A comprehensive overview of over 50 modern AI agent benchmarks, categorized into function calling, reasoning, coding, and computer interaction tasks.