Impact-Site-Verification: f601b76f-8b13-493f-b88a-e401694e2e56

Knowledge Curation: The Hard Part Isn't Getting Data In

2026-03-08 · 5 min read · Janaina Maia

When I started building the knowledge base for Urbix, my instinct was to get everything in.

Every planning scheme. Every council policy document. Every engineering standard I could find. The more data, the smarter the AI. That seemed logical.

It was wrong.

Six months in, I had an AI that could confidently discuss planning policy from three different states simultaneously, sometimes within a single answer, without distinguishing between them. It would blend a Queensland planning scheme with New South Wales legislation and present the result as applicable to a Brisbane project.

The data was all technically correct. The curation was a disaster.

The Hard Part Is Deciding What to Leave Out

Anyone can dump documents into a vector database. It takes an afternoon and a reasonable internet connection. The hard part is deciding what to leave out.

This is a skill that doesn't get discussed much in AI product circles, probably because it doesn't feel like a technical problem. It isn't. It is an editorial problem. And most AI builders are not thinking like editors.

An editor's job is to serve the reader by removing everything that doesn't serve them. An AI knowledge curator's job is the same: serve the user by removing everything the AI doesn't need to answer their questions accurately.

More is not better. More is usually worse.

Garbage In, Confident Garbage Out

The dangerous thing about poorly curated AI is that it doesn't give you garbage answers that look like garbage. It gives you garbage answers that look like confident, well-structured, professionally appropriate responses.

A planning AI trained on mixed or outdated documents won't say it is confused. It will cite specific clauses from the wrong version of a policy, in fluent professional language, with no indication that anything is amiss.

This is worse than no answer. A user who gets no answer knows to look elsewhere. A user who gets a confident wrong answer may not question it until the consequences show up.

The quality of your AI's answers is determined more by what you put in the knowledge base than by what model you use or how clever your prompts are. This is not a popular thing to say in rooms full of engineers excited about model comparisons. It is true nonetheless.

How I Think About Knowledge Curation Now

I think about it in three passes.

Pass one: relevance. Does this document contain information that users will actually ask about? If the answer is maybe, the document doesn't go in. Ambiguous relevance creates ambiguous answers.

Pass two: accuracy and recency. Is this the current version? Has it been superseded? Are there known errors? Outdated documents are often worse than no documents because the AI will present old rules as current without any flag that they have changed.

Pass three: jurisdiction and scope. Who is this for? In urban planning, a New South Wales planning document is not relevant to a Queensland project. Mixing jurisdictions creates exactly the category error I described earlier. Each agent in Urbix has a tightly defined document scope.

The Library vs. The Briefing Room

I use this mental model when explaining knowledge curation to people building AI products.

A library stores everything. It serves people who know what they are looking for and can evaluate sources themselves. It is not curated; it is comprehensive.

A briefing room contains only what the person in the room needs to know right now. Every document in it was chosen because it belongs there.

Most AI knowledge bases are built like libraries. They should be built like briefing rooms.

Your user is not a researcher with unlimited time to evaluate contradictory sources. They are a professional with a specific question and a deadline. They need the right answer, not all the answers.

What to Actually Remove

In building Urbix, I removed more than I kept. Here is what I consistently cut:

Superseded policy versions. If there is a 2024 version, the 2019 version goes out. The AI will not reliably distinguish between them without explicit help, and that help is not worth the overhead.
Documents from outside the jurisdiction being served. Councils are different. Policies are different. Mixing them creates plausible-sounding errors.
Generic guidance documents that repeat what is in primary sources. These dilute the signal. The AI interprets them as additional authoritative sources when they are actually just commentary.
Marketing material and public communications from councils. These are written for general audiences, not professional use.
Draft and consultation documents unless specifically relevant to current applications.

Curation Is Not a One-Time Event

Planning policy changes. Councils adopt new codes. State governments update legislation. An AI knowledge base that was accurate at launch will drift toward inaccuracy without ongoing curation.

I set a review cycle for every document in Urbix's knowledge base. Each one has a metadata record with the source, the version, the date of ingestion, and a scheduled review date. When councils update their policies, we remove the old version and ingest the new one.

This is not glamorous work. It is not the kind of thing that makes a good demo. It is the kind of thing that determines whether a professional can trust your product a year after launch.

The Editorial Mindset

If you are building an AI product in a specialized domain, you are an editor whether you think of yourself that way or not. Every document you include is an editorial decision. Every document you exclude is too.

The quality of those decisions is a larger factor in your product's reliability than most teams want to acknowledge, because it means the solution is not purely technical. It requires judgment, domain knowledge, and ongoing maintenance. Build like an editor. Your users will notice the difference, even if they can't name it.