Wednesday, May 27, 2026

AI: Evals for Agentic Systems: Promptfoo by OpenAI

Macro Evals for Agentic Systems @OpenAI

When an agentic system fails, the problem is often larger than a single bad response. A handoff may happen too late, a specialist agent may miss the same signal across many runs, or a review process may trigger for the wrong class of cases. To improve the system, teams need to see recurring behavior across the whole population of traces.

This cookbook walks through a macro-eval workflow for a multi-agent system. We use a synthetic EV order workflow where specialist agents handle pricing, compliance, supply, factory routing, scheduling, and release decisions while market and operational conditions change.


Intro | Promptfoo

romptfoo is an open-source CLI and library for evaluating and red-teaming LLM apps.

With promptfoo, you can:

  • Build reliable prompts, models, and RAGs with benchmarks specific to your use-case
  • Secure your apps with automated red teaming and pentesting
  • Speed up evaluations with caching, concurrency, and live reloading
  • Score outputs automatically by defining metrics
  • Use as a CLI, library, or in CI/CD
  • Use OpenAI, Anthropic, Azure, Google, HuggingFace, open-source models like Llama, or integrate custom API providers for any LLM API

The goal: test-driven LLM development, not trial-and-error.





No comments:

Post a Comment