Shipping RAG: The Quest for Quality

Companion to Part 1: What I Learned Building a RAG System on Real, Messy Data.

Where Part 1 Left Off

Part 1 was about getting PermitIQ to the point where it returned plausible answers across 60+ city municipal codes. Scraping. Chunking. Embedding. The data pipeline. The standard RAG (Retrieval-Augmented Generation) playbook with some changes I made along the way.

By the end of that article the system worked. I had deployed it, the URL resolved, and “build an ADU (Accessory Dwelling Unit) in Berkeley” came back with a sensible answer and cited sections.

That was the easy part.

This post is what happened after I started actually using it, what broke, what I changed, what I measured, and how much it cost me to find out.

Shortcomings of the Phase 1 Implementation

I had quietly assumed that if Oakland worked well, the other 59 cities would be in the same ballpark. They were not. The first time I ran a single question against every city in the system, more than a third of them returned garbage. That kicked off everything that follows.

I’ll walk through it in the order it happened: the audit that revealed the problem, the bugs I had to fix, the improvements I shipped, and the measurement framework I built so I could stop guessing.

The Quality Audit: One Question, 35 Cities

I wrote a short async script that hit /api/chat for every live city in the system with the same question:

“What permits do I need for a kitchen remodel?”

Then I dumped the answers into a single file and read all of them.

Fourteen of thirty-five cities returned garbage or nothing. Root causes, in roughly the order I uncovered them:

Empty Indexes: Several cities had been flipped to live status before their scrape job had actually run successfully.
Bot Protection: Some cities were behind bot-protection systems. The scraper got a 200 response with an “Access Denied, please complete the captcha” page in the body. The system happily indexed the rejection notice as if it were code.
Stale Seeds: City websites redesigned, URLs 404’d, and the embedding model dutifully encoded “Page Not Found” as a vector.

The takeaway: one doesn’t know RAG quality until it is tested systematically. A quality gate script should be a deployment step, not an afterthought.

The HNSW Filtered Query Bug

Most cities had not just thin data; they were returning zero results for every query. I assumed I had broken the retrieval pipeline. I had not. I had broken the index.

I had recently migrated the database from per-city schemas into one shared embeddings table with a city column. The advantage was one HNSW (Hierarchical Navigable Small World) index instead of 60. The bug was that the HNSW index now spanned all cities globally.

The query I was running looked like this:

SELECT ... FROM embeddings
WHERE city = 'houston'
ORDER BY embedding <=> $1::vector
LIMIT 12;

What HNSW actually does on that query: it walks the graph and returns the 12 globally nearest vectors. Then PostgreSQL applies the WHERE city = 'houston' filter to those 12 rows. The global nearest neighbors are dominated by Denver (178k rows) and Austin (104k rows), both of which have very dense embedding spaces. After the filter, one gets zero Houston rows. The user sees “No results found.”

The fix was a one-line GUC (Grand Unified Configuration parameter) introduced in pgvector 0.8.0:

SET hnsw.iterative_scan = relaxed_order;
SELECT ... FROM embeddings WHERE city = $1
ORDER BY embedding <=> $2 LIMIT 12;

iterative_scan = relaxed_order tells the planner to keep expanding the HNSW search until it accumulates enough rows that also satisfy the WHERE clause.

There was a second, smaller bug hiding behind the first: SET and SELECT need to run on the same database connection. My code was using two separate pool.query() calls, which were grabbing different pooled clients. The SET was effectively a no-op for the subsequent SELECT. Switching to a dedicated pool.connect() for the pair fixed it.

The System Prompt: Banning the Apology

Even cities with good data were producing answers like: “Unfortunately, the retrieved sections don’t specifically address…”

The model’s default is to hedge when uncertain. One has to explicitly ban the behavior.

“Do NOT open with disclaimers, apologies, or ‘unfortunately’. Lead directly with the answer.”
“First share everything the retrieved sections DO say, then add ONE brief closing note if something is genuinely missing.”

The texture of the answers changed immediately. Leading with substance matters.

“What Can I Ask?”: Coverage Before the First Question

A user lands on a thin-data city, asks a question, gets a bad answer, and loses trust in the entire system.

The fix: surface coverage information upfront. GET /api/coverage/[cityId] queries the database for chunk counts and a random sample of breadcrumb paths. A small LLM (Large Language Model) then summarizes these into 3 - 4 plain-English sentences. The honest framing of what the system doesn’t cover pre-empts the trust-killer.

Hybrid Search: BM25 + Dense Vectors + RRF

Pure vector search fails on exact-match queries like “Section 420.6” or “Title 17”. Semantic search is great for intent; it’s terrible for legal citations.

The fix is hybrid search:

Dense pass: HNSW cosine similarity.
Sparse pass: PostgreSQL full-text search (BM25 ranking).
Fusion: Reciprocal Rank Fusion (RRF), score = Σ 1/(rank + 60).

Rank fusion sidesteps the normalization problem between cosine distances and BM25 (Best Match 25) scores. I blend them at a 0.7/0.3 ratio - dense still dominates, but BM25 gets to “vote” for exact matches.

The database side was handled with a GIN (Generalized Inverted Index) built CONCURRENTLY to avoid table locks:

CREATE INDEX CONCURRENTLY embeddings_fts_idx
ON embeddings USING gin (to_tsvector('english', coalesce(text, '')));

Contextual Retrieval: Scaling Anthropic’s Technique

This was the single biggest quality lever in the post-launch period. The core idea (from Anthropic’s research) is that chunks in isolation lack context. A chunk saying “Maximum height is 18 feet” could be about fences, ADUs, or sheds. By prepending a context sentence to each chunk before embedding, one preserves its “place” in the legal hierarchy.

The Prompt Engineering

The “meat” of this technique is the prompt used to generate the context. It needs to be precise and descriptive. My enrichment prompt looks like this:

“Write a single sentence (max 80 words) that situates this chunk within the document. Include the section number or title, the topic, and the key requirement or condition it establishes. Output ONLY the sentence.”

This forces the model (Gemini 2.5 Flash Lite) to ignore the noise and focus on the legal identity of the chunk.

Engineering at Scale: 600k Chunks

Enriching a few chunks is easy. Enriching 600,000 chunks across 60 cities is a distributed systems problem.

Parallelism: I used a ThreadPoolExecutor with 15 concurrent workers. This hit the “sweet spot” where throughput was maximized without triggering the 429 rate limits of the Gemini API.
Checkpointing: Processing 600k chunks takes ~7 hours. If the script crashes at hour 6, one doesn’t want to start over. I implemented a pickle-based checkpointing system that saves progress every 500 chunks.
Cloud Run Jobs: To run this in production, I packaged the script into a Cloud Run Job. I sharded the work across 4 parallel tasks, each handling a subset of the cities. Total cost: ~$57 in Gemini Flash Lite calls plus pennies in compute.

Measuring the Lift

The results were immediate and measurable. On my Oakland test set, contextual retrieval provided a +10% lift in Faithfulness and a +5% lift in Context Precision.

The reason? When a user asks about “ADU height,” the embeddings for chunks enriched with “This section establishes height limits for Accessory Dwelling Units (ADUs)…” are now much closer to the query than raw text chunks that just say “Maximum height is 18 feet.”

RAGAS: Building a Real Evaluation Loop

Building a RAG system without evaluation is roughly equivalent to refactoring code without tests. It works until it doesn’t, and one has no way to know when “doesn’t” starts.

When I first shipped, my evaluation “loop” was me typing five questions into the UI. That doesn’t scale to 60 cities. I needed a systematic way to measure quality. I built an evaluator inspired by the RAGAS (RAG Assessment) framework, using the LLM-as-a-Judge pattern.

The Golden Dataset

I hand-curated a “Golden Dataset” of 26 questions that represent the real diversity of user intent in this domain:

Procedural: “How do I schedule a building inspection?” or “How do I get a demolition permit?”
Legal/Quantitative: “What is the maximum lot coverage allowed?” or “What are the height and setback requirements for a fence?”
Ambiguous/Multi-part: “What permits do I need for a kitchen remodel?” (requires building, electrical, and plumbing context).
Negative/Out-of-scope: “What is the best restaurant near city hall?” (Testing if the system correctly rejects non-permit questions).

Having a fixed set of questions is critical. It allows one to A/B test changes - like swapping an embedding model or tweaking a prompt - and see exactly how the numbers move.

The Metrics: Faithfulness & Context Precision

The evaluator measures two core metrics on a 0.0 to 1.0 scale:

Faithfulness (The Hallucination Guard) This measures if the answer is grounded only in the retrieved context. The judge (Gemini 2.5 Pro) extracts every factual claim from the answer and verifies it against the context.

def score_faithfulness(question, answer, contexts, judge):
    # Step 1: Extract claims
    claims_raw = _gen(judge, f"List each distinct factual claim in this answer: {answer}")
    claims = [c.strip() for c in claims_raw.splitlines() if c.strip()]
    
    # Step 2: Verify each claim against context
    supported = 0
    for claim in claims:
        verdict = _gen(judge, f"Context: {contexts}\nClaim: {claim}\nIs this supported? YES/NO")
        if verdict.upper().startswith("YES"):
            supported += 1
    
    return supported / len(claims)

Context Precision (The Retrieval Guard) This measures if your search is actually finding the right needles in the haystack. It uses a ranking metric to ensure the most relevant chunks are at the top of the list.

def score_context_precision(question, contexts, judge):
    relevance = []
    for ctx in contexts:
        verdict = _gen(judge, f"Question: {question}\nContext: {ctx}\nIs this relevant? YES/NO")
        relevance.append(1 if verdict.upper().startswith("YES") else 0)

    # Compute Average Precision over the ranked list
    total_relevant = sum(relevance)
    if total_relevant == 0: return 0.0

    score, running = 0.0, 0
    for k, rel in enumerate(relevance):
        if rel:
            running += 1
            score += running / (k + 1)
    return score / total_relevant

The “Thinking Token” Trap

I used Gemini 2.5 Pro as the judge. Initially, I set max_output_tokens=8 for the YES/NO judge call, assuming a one-word answer would be fast and cheap.

It wasn’t. Gemini Pro uses internal “thinking” tokens before producing output. Those tokens count against the limit. With a limit of 8, the thinking tokens consumed the entire budget, and the model returned an empty string. My parser saw an empty string, assumed “NO”, and my first eval run showed 0% quality across every city.

The fix: Bump the budget to max_output_tokens=256. One only pays for what is used, so the ceiling is free, and it gives the model room to “think” before it commits to a YES.

The 5-City Results

I ran the 26-question set against five representative cities (130 evals total).

City	Faithfulness	Context Precision
Oakland	0.347	0.513
Berkeley	0.417	0.423
San Francisco	0.487	0.622
Irvine	0.572	0.099
Denver	0.290	0.319
Average	0.423	0.395

The takeaway: Faithfulness is relatively stable (0.29 - 0.57), meaning the generator is behaving consistently. But Context Precision is the variable. Irvine (0.10) is a retrieval emergency. The scraper likely missed the breadcrumb structure, leaving the search blind.

RAGAS turned “it feels better” into a number one could track per deploy.

Claude -> Gemini in the Live App

Economics forced a migration from Claude Sonnet to Gemini 2.5 Flash. Cost dropped from $0.04 to $0.005 per chat turn - an 8x reduction.

However, Gemini surfaced Citation Stacking: citing 12 identical chunks for one rule. “Maximum height is 18 feet [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12].”

The fix:

Deduplicate chunks by exact text match before they reach the model.
Aggressive Prompting: “Never write [1][2][3]…[12]. That is noise.”

The Cost of Scaling: Transitioning to Local LLMs

While Gemini is 8x cheaper than Claude, a “hobby” project can still rack up a bill during a heavy evaluation run or a viral spike in traffic. If one is looking to cap spend, the next logical step is to bring the execution local.

Local Options

In 2026, one does not need a massive server farm to run high-quality models. There are two main paths:

Ollama (Development & Prototyping): The “Docker for LLMs.” It is the easiest way to run models like Llama 4 or Qwen locally. It handles the quantization (compressing the model) and provides a simple local API.
vLLM (Production & Throughput): If one wants to serve multiple users at once, vLLM is the gold standard. It uses “PagedAttention” to handle concurrent requests much more efficiently than a standard setup.

Hardware: VRAM is King

The cost of “free” local execution is the upfront hardware investment.

The Budget Build: An RTX 3060 12GB (~$250 used) can run 8B-parameter models comfortably.
The Sweet Spot: An RTX 5060 Ti 16GB (~$500) can handle 14B-20B models, which are often the “sweet spot” for reasoning tasks.
The Apple Alternative: A Mac M4 Pro with 64GB of RAM is the best value for running massive 70B models, as its unified memory allows the GPU to use the entire system RAM.

Do I still need Google Vertex AI?

Strictly speaking, no. One can run a fully local RAG stack:

LLM: Run Llama 3 via Ollama locally.
Embeddings: Use an open-source model like BGE-M3 or nomic-embed-text locally instead of Vertex AI.
Database: Keep using Neon (PostgreSQL) for your vector store. Neon’s free/hobby tier is generous, and one only pays for storage and compute when the DB is “awake.”

The trade-off is maintenance vs. cost. Vertex AI is a managed service - it is always there, it scales, and one does not have to worry about a local power bill or GPU cooling. But for a heavy user, a $1,500 PC pays for itself in roughly 6 months of API savings.

Hardware Budget (June 2026)

If one is ready to make the jump, here is a cost-effective hardware recipe for high-performance local RAG:

Component	Budget Choice	Cost
GPU	Used RTX 3060 12GB	$200
PC	Used Dell OptiPlex 7080 MT	$130
PSU	New/Used 550W PSU	$60
Misc	Power Adapter / Shipping	$20
TOTAL		$410

Where to shop (June 2026):

eBay: The most reliable source for used GPUs. Look for sellers with high ratings and original packaging if possible.
Back Market / VIPOutlet: Excellent for finding “base” business desktops like the OptiPlex with a warranty.
FB Marketplace: Best locally for deals on gaming PCs being sold without a GPU by users who just upgraded.

The ROI Verdict: For an upfront investment of ~$410, one can eliminate the ~$20/month recurring hosting and API bill. This setup pays for itself in roughly 20 months. More importantly, one gains “instant” response times and the freedom to run 1,000 evaluations a day without checking a credit balance.

What’s Next

Summer Vacation: Taking a well-earned break before the next phase.
DIY LLM/Postgres System: Building the actual hardware and migrating the entire stack to a local, air-gapped environment.
Temporal Versioning: When was this section last updated?
Entity Extraction: Turning ordinance numbers and fee amounts into metadata filters.
Continuous Eval in CI (Continuous Integration): Catching regressions at PR (Pull Request) time.

Musings: The Engineering is in the Wrappers

In Part 1, I noted that the LLM is rarely the bottleneck. Phase 2 proved it. Everything I shipped - HNSW fixes, hybrid search, contextual embeddings, RAGAS - was a data-or-systems problem.

There is a temptation to attribute AI quality to the model itself. But the model is the most stable component. The real engineering happens in the “wrappers”: the chunker, the retriever, the database, and the evaluation loop.

Phase 1 was about investigating if it could be done. Phase 2 was about investigating if it could be engineered.

About This Project

PermitIQ was built on my own time. Total spend: $100 - 150, mostly on embeddings and evaluation. Storage and serving costs remain negligible.

Thanks for reading.