Beyond AI Benchmarks

Golden data, custom criteria, and the competitive advantage hiding in your evaluation strategy - featuring Lake Merritt, an open-source platform putting AI quality control back in leadership's hands

Sep 05, 2025

Every board meeting about AI eventually seems to arrive at the same uncomfortable moment. After the presentations about efficiency gains and innovation potential, after the breathless vendor demos and the carefully rehearsed use cases, someone asks the question that stops everything cold: “But how do we know it actually works for us? For our specific needs, our standards, our risks?”

The silence that follows is expensive. Benchmarks prove competence in the abstract; your risks live in the specifics. Edge cases, specialized terminology, and unique constraints that define your work rarely appear in anyone else’s test suite. The gap between benchmark scores and your reality isn’t a few percentage points, it’s damaged client trust, regulatory scrutiny, and sleepless nights for the executives who signed off on the deployment.

This gap between promise and performance isn’t a technical glitch. It’s a governance challenge. And it reveals something profound about how we’ve been thinking about AI leadership entirely wrong.

The Blindspot in Every AI Playbook

Pick up any executive guide to AI transformation, recent executive AI guides from IBM, McKinsey, and the Big Four consultancies, and you’ll find sophisticated frameworks for governance, detailed roadmaps for implementation, and compelling visions of AI-powered futures. These books and reports get 90% of the story right. They correctly identify that leaders must move from being passive consumers of AI to active creators of AI value. They emphasize governance, skills development, and strategic alignment.

But they systematically omit the single most important mechanism for achieving these goals: how leaders translate their deep domain expertise, their understanding of what quality means in their specific context, into measurable, enforceable standards for AI systems.

This isn’t a minor oversight. It’s the difference between governance theater and actual control. Between hoping your AI behaves and knowing it will perform.

The authors of these guides aren’t ignorant. These guides tend to focus on high-level strategy and often treat evaluation as a technical implementation detail. But this reveals a fundamental misunderstanding of what evaluation actually is. It’s not quality assurance. It’s not testing. It’s the very act of encoding what your organization values into a form that can be measured, managed, and improved.

When a law firm defines what constitutes a properly researched legal memo, when an insurance company articulates what empathetic claim handling looks like, when a bank specifies acceptable risk thresholds, these aren’t technical specifications. They’re strategic decisions that define competitive advantage. And in the AI era, these decisions must be translated into what I call “evaluation-as-policy.”

The Non-Delegable Duty of Defining “Good”

Here’s what the playbooks miss: in an AI-transformed enterprise, defining what constitutes acceptable performance isn’t something leaders can delegate to their technical teams. It’s not something they can outsource to vendors. It’s a fundamental leadership responsibility as non-negotiable as setting strategy or managing risk.

Think about how you currently ensure quality in human work. You don’t just hire smart people and hope for the best. You provide clear expectations. You review work products. You give specific feedback. You know what good looks like because you’ve spent years developing that expertise.

The same expertise that allows you to recognize a well-crafted legal argument, a compelling marketing campaign, or a thorough risk assessment is exactly what’s needed to create meaningful AI evaluations. The only difference is that instead of reviewing work after the fact, you’re encoding your standards upfront in a form that can be systematically applied.

This is where the concept of “golden data” becomes critical. Golden data isn’t just training data or test data. It’s the carefully curated collection of examples that embody your organization’s definition of excellence. Each example is a concrete instantiation of your standards, your values, your risk tolerance.

Creating golden data isn’t a technical task, it’s a leadership function. When your general counsel reviews AI-generated legal summaries and annotates what’s acceptable and what’s not, she’s not doing QA. She’s encoding the firm’s legal standards into a strategic asset. When your head of customer service identifies model responses that perfectly capture your brand voice, he’s not just providing feedback. He’s building competitive advantage.

From Abstract Principles to Executable Standards

The challenge, of course, is that most leaders don’t know how to bridge the gap between their expertise and the technical requirements of AI evaluation. They can articulate what they want—“accurate legal citations,” “empathetic customer responses,” “comprehensive risk assessments”—but they don’t know how to make these concepts measurable and enforceable.

This is the murky void that exists in most organizations today. Everyone agrees that evaluation is important. Few understand how to actually do it. Even fewer realize that the solution doesn’t require technical expertise, it requires clear thinking about what matters to your business.

Let me make this concrete. Evaluation, at its core, follows a simple three-column pattern: input (what goes into the AI), output (what the AI produces), and expected output (what you wanted it to produce). This isn’t complicated. It’s exactly how you’d evaluate human work, just structured more systematically.

The power comes from how you assess the relationship between your system's actual output and the expected output. Sometimes you need exact matches—a legal citation must be precisely correct. Sometimes you need fuzzy matching—a customer service response should cover the right points even if the wording varies. And sometimes you need nuanced judgment—does this financial advice demonstrate appropriate fiduciary duty?

This is where the concept of LLM-as-a-Judge becomes transformative. Instead of trying to codify every possible variant of acceptable output, you can articulate your standards in natural language—the same way you’d instruct a human employee—and use a language model to assess whether outputs meet those standards.

If you can write a memo explaining what makes a good quarterly report, you can create evaluation criteria for AI-generated reports. If you can train a junior attorney on proper legal research, you can define standards for AI legal research. The skill you need isn’t programming. It’s the ability to articulate what you already know.

The Strategic Asset Nobody’s Talking About

Here’s what should keep executives up at night: while you’re treating evaluation as a technical afterthought, your competitors might be building it as a strategic asset. Because your evaluation criteria and golden datasets aren’t just test files. They’re the usable codification of your organizational knowledge, competitive insights, and strategic priorities.

Consider what goes into a sophisticated evaluation suite for a law firm’s AI systems. It contains examples of how to spot obscure jurisdictional issues that only experienced partners would catch. It embodies the firm’s approach to risk assessment that differentiates it from competitors. It captures the nuanced judgment calls that define the firm’s reputation.

This isn’t a generic capability that any firm could replicate. It’s proprietary intellectual property as valuable as any other strategic asset. Some evaluations—basic accuracy, general fairness—can and should be shared across industries. But your core evaluations, the ones that capture what makes your organization unique, are trade secrets.

The organizations that recognize this are doing something radical: they’re treating evaluation development as a C-suite responsibility. They’re running cross-functional workshops where legal, risk, product, and customer service leaders collaborate to define golden datasets. They’re version-controlling these assets like critical code. They’re measuring and reporting on evaluation coverage like any other strategic metric.

Making It Real: From Theory to Practice

At this point, you might be thinking, “This sounds important but impossibly complex.” Let me show you how wrong that assumption is. You can start meaningfully evaluating your AI systems this week with just a spreadsheet and clear thinking.

To see this principle in action, you can try it yourself in under two minutes using our open-source platform, Lake Merritt. Follow the first exercise in the Quick Start guide, a “60-Second Sanity Check.” You’ll simply create a spreadsheet with three columns: the input (the question you ask the AI), the output (the AI's actual response), and the expected_output (your definition of a perfect answer). When you run the evaluation, you’ll see how an “LLM-as-a-Judge” programmatically assesses the quality of the actual output against your ideal expected_output. Fiddle with it, change the content in the expected_output column and see how it impacts the evaluation scores. This simple, hands-on exercise will give you the concrete intuition needed to apply this process to your own business context.

Begin with what I call a “10-row quick start.” Take ten representative cases from a real use case in your business. For each input, develop your own idea of what outputs you expect and why, and then have domain experts define their ideal outputs. Settle on an initial set of expected outputs. This is your initial golden dataset. Now run your AI system against these inputs and compare its outputs to your golden standard.

The results will be immediately illuminating. You’ll see patterns in where the AI struggles. You’ll identify edge cases you hadn’t considered. Most importantly, you’ll begin developing intuition for what kinds of standards are easy to meet and which require more sophistication.

As you develop confidence, you can scale this approach. The ten rows become a hundred, then a thousand. The simple comparisons evolve into sophisticated rubrics. The ad-hoc checks become systematic “evaluation packs”, version-controlled, repeatable test suites that can be run automatically before any AI system updates are deployed.

There’s an even more powerful approach that allows your leadership to encode their expertise more rapidly: learning from reality. This method allows your executives to shift from being authors to being editors, which is often a more efficient use of their time. Instead of trying to define perfect outputs upfront, have your key leaders and their most trusted senior experts (the same people who define your strategy) annotate actual AI outputs. They can mark what’s good, what’s problematic, and what’s unacceptable. These leadership-validated annotations then become core foundations for your evaluation system, ensuring it recognizes quality the same way you would.

To make this concrete: for a legal summary AI system, instead of asking your general counsel to write ten perfect legal summaries from scratch, you can present her with ten AI-generated summaries and have her annotate them, correcting a citation here, flagging a risk there. Those annotations, born from senior-level judgment, become the executable standards for your evaluation system. This creates a virtuous cycle where your top experts continually refine the AI's alignment with your organization's most critical standards.

This creates a virtuous cycle. Your AI systems generate outputs. Your experts review and annotate them. These annotations become evaluation criteria. The evaluations drive improvements. The improved systems generate better outputs. And the cycle continues, with each iteration encoding more of your organization’s expertise into measurable, manageable form.

The Agent Revolution Changes Everything

So far, I’ve focused on evaluating AI outputs, the text, analysis, or recommendations that AI systems produce. But the next generation of AI isn’t just generating content. It’s taking action. AI agents are making decisions, using tools, following processes, and interacting with other systems in complex workflows.

This fundamentally changes what evaluation means. It’s no longer sufficient to check if the final answer is correct. You need to evaluate the entire process. Did the agent use the right tools? Did it follow required procedures? Did it respect security boundaries? Did it escalate appropriately when uncertain?

Consider a legal research agent. The quality of its final memo matters, but so does its process. Did it search the right databases? Did it prioritize binding precedent appropriately? Did it verify that cited cases haven’t been overturned? These behavioral evaluations require a different approach, one that captures and analyzes the full trajectory of the agent’s actions.

This is where technical concepts like OpenTelemetry traces become essential. But don’t let the jargon intimidate you. A trace is simply a record of everything the agent did, every tool it called, every decision it made, every piece of data it accessed. Evaluating these traces means you can ensure not just that the agent reached the right conclusion, but that it got there the right way.

The implications are profound. In traditional software, you could separate business logic from implementation details. In agentic AI, the process IS the product. The way an agent conducts legal research, handles customer complaints, or analyzes risk isn’t just a means to an end—it’s a direct expression of your organizational values and standards.

Proof That This Works

These aren’t theoretical frameworks or academic exercises. Organizations are using these approaches today to solve real problems and prevent real failures.

Consider a challenge at the heart of AI governance: ensuring systems behave fairly and align with your company’s values. This isn't just a legal or regulatory checkbox; it's fundamental to brand safety, customer trust, and strategic alignment. A powerful example is the BBQ (Bias Benchmark for QA), a rigorous academic framework for detecting demographic bias. Using a tool like Lake Merritt, this top-tier public benchmark can be implemented as a reusable "evaluation pack" to systematically test your systems. To underscore its industry significance, BBQ was the sole fairness and bias benchmark OpenAI chose to use in its safety testing for GPT-5. This shows how you can move beyond theory to not just flag problems, but quantify them, track them over time, and ensure that fixes actually work.

This same approach of codifying standards applies to any area where deep, nuanced domain expertise is your competitive advantage. Rather than rely on generic public benchmarks like BBQ, however, the task is to develop your own measures that support and reflect your organization's priorities and imperatives. For instance, a financial services firm can move beyond generic compliance to evaluate its unique interpretation of "fiduciary duty." Such an evaluation might progress from basic, deterministic checks—like verifying required disclosures are present—to sophisticated, judgment-based assessments of whether advice truly serves a client’s best interests in a nuanced scenario.

Crucially, these evaluations work because they are built by the domain experts who own the outcome, not by technicians. In the financial services scenario, this means the legal team defines disclosure, compliance specifies risk scenarios, and customer advocates articulate what "client’s best interests" means in practice. But the principle is universal: for a marketing AI, the brand team would define what is "on-brand"; for a medical AI, clinicians would define a "safe diagnostic summary." The technical team's role is to simply implement these expert-defined standards into a systematic, repeatable process.

The Ecosystem of Evaluation

To demonstrate that these concepts aren’t just theory, I’ve built Lake Merritt, an open-source evaluation workbench that embodies these principles. I use Lake Merritt every day to evaluate my own AI apps and services, and have also utilized it effectively as part of Civics.com's professional consulting services, ensuring that my clients' AI products operate as expected. But let me be clear: Lake Merritt isn’t the point. The methodology is the point. Lake Merritt simply proves that the methodology works.

The platform does several things that matter. It provides a web interface simple enough that a lawyer or product manager can use it without training. It supports what I call the “Hold My Beer” workflow—where you can go from a vague idea about quality to a working evaluation in minutes. It treats evaluations as code, making them versionable, shareable, and systematic. It can evaluate not just outputs but entire agent workflows through OpenTelemetry trace analysis.

While I launched Lake Merritt this week because I think it’s valuable to have an easy to use evals tool that non-technical people can get started with, this software is just one option in a rich ecosystem of evaluation tools. Arize Phoenix provides powerful observability and monitoring capabilities. Galileo offers sophisticated analytics and agent debugging tools. Open-source projects like DeepEvals and OpenAI Evals provide flexible frameworks for custom evaluations. LangWatch excels at specific use cases. Each serves different needs at different scales.

In the legal domain specifically, pioneers are emerging. Vals has published groundbreaking reports on legal AI evaluation. ScoreCard is working to standardize agent evaluations for legal use cases. Individuals like Ryan McDonough who is a true global thought leader on AI and evals in law at KPMG, and newer voices like Anna Guo and her collaborators in Singapore, are openly sharing their learnings and pushing the field forward. There are many, many others making starting to make strides.

This diversity is healthy and necessary. No single tool or approach will serve every need. What matters is that organizations develop the capability—through whatever tools make sense for them—to systematically evaluate their AI systems against their specific standards.

We’re in the advanced planning stage now of bringing this community together at an evaluation summit jointly hosted by Stanford and MIT. The goal isn’t to crown winning tools or approaches. It’s to share learnings, establish best practices, and accelerate the entire field’s development. To stay informed about that event or if you have constructive and relevant work in the custom evaluations arena, please reach out here.

Your Path Forward

If you’ve read this far, you’re probably convinced that custom evaluation matters. The question is what to do about it. Let me give you a practical path forward that you can start this week.

First, identify your highest-risk AI use case. This is where evaluation matters most and where you’ll get immediate value from better oversight. Don’t try to boil the ocean. Pick one critical application and focus there.

Second, convene your domain experts. Bring together the people who truly understand what quality means for this use case. This isn’t a technical meeting, it’s a business meeting. The question on the table is simple: “What does good look like?”

Third, create your first golden dataset. Start small, even ten examples are enough to begin. For each example, capture the input and the ideal output. Have your experts explain why each output is ideal. These explanations become the seeds of your evaluation criteria.

Fourth, test your current AI system against this golden dataset. Don’t expect perfection. Expect illumination. You’ll immediately see patterns in where your system struggles and where it excels.

Fifth, iterate and expand. Add more examples. Refine your criteria. Develop more sophisticated evaluations. Move from manual checks to automated gates. Build evaluation into your deployment pipeline so that no AI update goes live without passing your standards.

This isn’t a technical project. It’s a governance initiative. It’s how you exercise real control over AI systems that are increasingly critical to your operations. It’s how you ensure that AI serves your strategic objectives rather than undermining them.

The Executive Imperative

We’re at an inflection point in how organizations create value with AI. The experimental phase is ending. The operational phase is beginning. And in this operational phase, the organizations that thrive won’t be those with the most sophisticated models or the largest datasets. They’ll be those that can most effectively translate their human expertise into AI capabilities.

This translation happens through evaluation. Not generic benchmarks or vendor-supplied metrics, but custom evaluations that embody your specific standards, values, and priorities. These evaluations aren’t a tax on innovation, they’re an accelerator for it. They allow you to move fast because you can move with confidence. They allow you to delegate to AI because you can verify performance. They allow you to differentiate because you can systematically improve what matters most to your business.

The choice facing every executive is stark. You can continue treating AI evaluation as a technical detail, hoping that your vendors and technical teams somehow divine what quality means for your organization. Or you can recognize that in the AI era, evaluation is the executive function, the mechanism through which leadership expertise shapes organizational outcomes.

Your AI strategy without custom evaluation isn’t a strategy. It’s expensive hope. And in a world where AI increasingly mediates critical business functions, hope is not a plan.

The boards that are asking “How do we know it works for us?” aren’t being paranoid. They’re being prescient. They understand that AI governance without custom evaluation is like financial governance without custom accounting standards, theoretically possible but practically meaningless.

The good news is that building evaluation capability doesn’t require massive investment or technical transformation. It requires clarity about what matters to your business and the discipline to measure it systematically. If you can articulate expectations to humans, you can create evaluations for AI. If you can recognize quality when you see it, you can encode that recognition into systematic assessment. Literally, that recognition just needs to be articulated in language in order to be usable as criteria in programmatic evals.

In the AI era, this isn’t optional. It’s existential. The organizations that master evaluation will shape AI to serve their purposes. Those that don’t will find themselves shaped by AI systems they don’t sufficiently control.

The question isn’t whether you’ll develop custom evaluation capabilities. It’s whether you’ll develop them before or after they become urgently necessary. Before or after your first AI crisis. Before or after your competitors use superior evaluation to deliver superior AI-powered services.

The time to start is now. Not because the technology demands it, but because leadership demands it. Because in a world where AI increasingly mediates how organizations create value, the ability to define and measure what “good” looks like isn’t just a technical capability.

It’s the executive function itself.

DazzaGreenwood's Weblog

Discussion about this post