Failure-First Testing: Stop Testing What Goes Right
2026-03-08 · 5 min read · Janaina Maia
When I built the first version of Urbix's AI, I tested it the way most teams test AI. I gave it questions it should be able to answer. It answered them. I felt good about this.
It took three weeks in production before a user found a failure I had completely missed.
The AI had been asked about setback requirements for a development type that our knowledge base didn't cover well. Instead of saying it didn't have that information, it blended adjacent rules and produced an answer that sounded authoritative and was completely wrong. The user, a professional planner, caught it before it caused real harm. But only barely.
I had tested for correct. I had never tested for confidently wrong.
Everyone Tests the Happy Path
This is the most common testing failure in AI product development. Teams build an evaluation set full of questions the AI should be able to answer, confirm that it answers them correctly, and ship.
That is testing the happy path. It tells you the AI can do what you designed it to do. It tells you almost nothing about what the AI does when it encounters edge cases, ambiguous questions, deliberate traps, and questions that fall just outside its knowledge.
Those are the cases that matter most in professional use. A planner using Urbix is not going to ask it easy questions with obvious answers. They ask the hard questions, the edge cases, the situations where they aren't sure what the rule is and need the AI to either know or admit it doesn't.
Testing only for correct means you are evaluating performance in the best case, not the typical case.
Build Trick-Question Test Suites
After the production failure I described, I rebuilt our test suite with a completely different philosophy. The new test suite has three categories, and only one of them tests for correct answers.
Category 1: Standard questions with known answers. Questions the agent is designed to handle, with correct answers verified by domain experts. These are the happy path questions. They still belong in the test suite, but they are not the primary evaluation.
Category 2: Questions the agent should refuse. Questions outside the agent's knowledge or scope. In Urbix, this includes questions about planning policy in jurisdictions we haven't built knowledge bases for, questions that require professional judgment rather than information retrieval, and questions about recent changes that haven't been ingested yet. The test here is not whether the AI gets the right answer. The test is whether it refuses to answer and does so gracefully.
Category 3: Trick questions and plausible confusions. Questions designed to surface the specific failure modes most likely to cause harm. These include questions where two similar-sounding rules have different applications, questions that blend two different jurisdictions, and questions where common professional assumptions conflict with specific policy text. These are the questions that broke us in production. Now they're in the test suite.
Questions Your AI Should Refuse to Answer
This concept is harder than it sounds to implement. AI systems are trained to be helpful, which means they tend to attempt every question even when they shouldn't. Getting a well-prompted AI to reliably refuse out-of-scope questions requires deliberate design.
For Urbix, we were explicit in the system prompts about scope. We gave the AI specific language for refusals: this falls outside my knowledge base, I recommend consulting the relevant council directly, this question requires professional judgment beyond information retrieval.
Then we tested those refusals. We threw out-of-scope questions at the agent and checked whether it refused appropriately or attempted to answer anyway. We measured the refusal rate and the accuracy of refusal decisions. A refusal where the agent should have known the answer is a miss. A refusal where the agent correctly identified an out-of-scope question is a success. Both count.
The Red Team Mindset
The most useful thing I did was ask a domain expert to try to break the AI.
Not test it. Break it. Give it the questions they would never normally ask a professional tool because the questions were too ambiguous, too edge-case, too likely to produce a wrong answer.
This exercise surfaced failure modes I hadn't imagined. Combinations of conditions that produced confident wrong answers. Question phrasings that confused the agent about which jurisdiction applied. Requests for information the agent had partially but not fully.
Every failure in that red team session went into the test suite. Now it is part of every evaluation run.
How to Structure the Test Suite
A practical structure for AI product test suites:
- 40% standard questions with known correct answers
- 30% out-of-scope questions that should be refused
- 30% trick questions and plausible failure modes
Track three metrics: accuracy on answerable questions, refusal rate on out-of-scope questions, and false confidence rate on trick questions. That last one is the number that matters most. It tells you how often the AI is confidently wrong.
Testing Is Never Done
Every production failure goes back into the test suite. Every user correction gets evaluated for whether it represents a new failure mode we hadn't tested for. The test suite grows with the product.
This is not a comforting way to think about testing. It means the work is never finished. But it is the honest way to think about it. A professional tool operating in a complex domain will encounter failure modes you didn't anticipate. The only question is whether you discover them in testing or in production.
Test for wrong. It is the only way to build something you can actually trust.