Friday, May 22, 2026

Evaluating AI Agents: benchmark vs eval vs rubric

 In AI agent development, benchmarks are standardized, public tests used to compare the general capabilities of different models, evaluations (evals) are the comprehensive processes used to test an agent's fitness for your specific use case, and rubrics are the specific scoring rules and criteria used to judge those outputs. [1, 2, 3]

Breaking Down the Terminology
1. Benchmarks
  • What they are: Standardized, often public datasets or environments (like SWE-bench or OSWorld).
  • Purpose: They give a general "SAT score" to compare how different base models rank against one another.
  • Agent Context: They check whether an agent can perform broad tasks. However, an agent that scores high on a benchmark can still fail in real production because it lacks context about your specific messy data or workflows. [1, 2]
2. Evals (Evaluations)
  • What they are: Your broader, customized testing strategy.
  • Purpose: They measure fitness for purpose. An evaluation framework uses a mix of metrics—such as automated code checks, tracing, user feedback, and custom datasets from your own production errors—to determine if the agent actually achieves its business goals. [1, 2, 3, 4, 5]
3. Rubrics
  • What they are: The fine-grained rules, taxonomies, and expectations used by evaluators to grade specific agent behaviors.
  • Purpose: They define how an agent should be graded. Instead of just checking for a simple "pass/fail," a rubric checks whether the agent followed prompt instructions, used the correct tools in the right sequence, and retrieved the proper internal data. [1, 2, 3, 4, 5]
How They Work Together
When building an agent, teams use benchmarks during initial model selection. Once development begins, they build a custom eval system to test real-world functionality. Finally, they use a detailed rubric to teach "LLM-as-a-judge" systems or human reviewers exactly what a successful agent trace looks like. [1, 2, 3, 4]

Google AI in Chrome: Gemini Nano

Google Chrome integrates its lightweight, on-device AI model, Gemini Nano, into the browser. Running directly on your device, it powers features like text summarization, content rewriting, and scam detection without sending your data to external servers. [1, 2, 3]
How it works & controversies:
  • Silent Downloads: Chrome will silently download a ~4GB weights.bin file into your user profile directory when it detects your device meets the hardware and storage requirements.
  • Privacy and Speed: Because the AI operates client-side, your data remains secure within your device, and it can function offline once downloaded.
  • Controversy: The background download occurs without explicit user consent, which has sparked privacy and storage concerns among users. [1, 2, 3, 4, 5]
How to Disable and Remove Gemini Nano:
If you want to free up the storage space or opt out of on-device AI, you can easily disable it in your browser settings:
  1. Click the three dots in the top-right corner of Chrome.
  2. Go to Settings.
  3. Select the System tab.
  4. Toggle On-device AI to the Off position. [1, 2, 3]
Disabling this feature stops the model from running and triggers Chrome to remove the model files from your computer. [1, 2]
You can read more about Google's Built-In AI APIs and features directly on the Google for Developers AI on Chrome documentation.

With built-in AI, your browser provides and manages foundation and expert models. In Chrome, that includes Gemini Nano.


SpaceX’s $2T Case, Nvidia’s Shock Selloff, America Turns on AI, Trump Pulls AI Order, Bond Crisis? - YouTube @all-in podcast

In the video, the inclusion of Gemini Nano in the Chrome browser is discussed between (9:20) and (12:28).

  • The Discovery: It is noted that approximately two weeks prior to the recording, Google quietly included the Gemini Nano model in the Chrome browser without explicitly notifying users (9:20 - 9:46).
  • Technical Details: The model, which is roughly 4 gigabytes in size, is installed locally on the computer and handles tasks such as proofreading, spelling, and autocomplete (9:32 - 9:41).
  • Discussion on Privacy: The panel discusses the user reaction to this, noting that many people were surprised or "shocked" by the background download. While some raised concerns regarding privacy, the hosts generally agree that Google is not necessarily acting with malicious intent, suggesting it was more of a "speed error" in communication rather than them being a "bad actor" in the space (11:22 - 12:28).

Google I/O 2026 Releases and Cerebras’ $95B IPO w/ Andrew Feldman | EP #256 - YouTube @Moonshots podcasts

The video provides an extensive recap of Google I/O 2026, highlighting a massive shift toward "agentic" AI across the Google ecosystem. The participants note that Google successfully "disrupted the disruptors," pivoting from being perceived as "cooked" to re-establishing leadership with full-stack AI integration (0:00 - 12:44).

Key Highlights from Google I/O 2026:

  • Gemini Omni & 3.5 Flash: Google introduced Gemini Omni for real-time multimodal interaction (video, text, audio) and Gemini 3.5 Flash, emphasizing superior throughput and speed (12:44 - 13:30, 18:41 - 20:00).
  • Agentic Operating Systems: The event showcased Anti-Gravity 2.0, a dedicated desktop application designed to orchestrate multiple AI agents in parallel to perform complex tasks like software development (33:00 - 34:26).
  • Gemini Spark: Positioned as an always-on personal agent, Spark handles background tasks like email drafting, RSVP tracking, and financial monitoring, utilizing dedicated virtual machines to save user time (39:50 - 41:30).
  • AI Search & Shopping: Google debuted an AI-powered search mode that changes the "shape of the rectangle" to provide interactive, agent-driven results, alongside a Universal Cart for cross-merchant shopping (48:30 - 55:10).
  • Science & Innovation: Gemini for Science was introduced to accelerate research, helping scientists generate hypotheses and simulate complex systems (1:15:00 - 1:18:43).

Note on Gemini Nano & Chrome: While the video focuses on the broader "agentic" transition at I/O, external context clarifies the role of Gemini Nano in the browser. Unlike the large, cloud-based Gemini models discussed in the video, Gemini Nano is an on-device model integrated directly into the Chrome browser (starting with Chrome 138). Its primary benefits include:

  • Privacy: Processes sensitive data locally, ensuring it never leaves the user's machine.
  • Performance: Eliminates network latency for tasks like text summarization, content rewriting, and language detection.
  • Security: Provides local, real-time scanning to detect phishing and scam sites.
  • Developer APIs: Enables web developers to build intelligent, chat-like features directly into their web applications using local compute.

AI "Thought Signatures"

 Thought Signatures  |  Gemini API  |  Google AI for Developers

Thought signatures are encrypted representations of the model's internal thought process and are used to preserve reasoning context across multi-step interactions. When using thinking models (such as the Gemini 3 and 2.5 series), the API may return a thoughtSignature field within the content parts of the response (e.g., text or functionCall parts).