Subodh Nijsure

Shipping RAG: The Quest for Quality

2026-06-03T00:00:00+00:00

Companion to Part 1: What I Learned Building a RAG System on Real, Messy Data.

Where Part 1 Left Off

Part 1 was about getting PermitIQ to the point where it returned plausible answers across 60+ city municipal codes. Scraping. Chunking. Embedding. The data pipeline. The standard RAG (Retrieval-Augmented Generation) playbook with some changes I made along the way.

By the end of that article the system worked. I had deployed it, the URL resolved, and “build an ADU (Accessory Dwelling Unit) in Berkeley” came back with a sensible answer and cited sections.

That was the easy part.

This post is what happened after I started actually using it, what broke, what I changed, what I measured, and how much it cost me to find out.

Shortcomings of the Phase 1 Implementation

I had quietly assumed that if Oakland worked well, the other 59 cities would be in the same ballpark. They were not. The first time I ran a single question against every city in the system, more than a third of them returned garbage. That kicked off everything that follows.

I’ll walk through it in the order it happened: the audit that revealed the problem, the bugs I had to fix, the improvements I shipped, and the measurement framework I built so I could stop guessing.

The Quality Audit: One Question, 35 Cities

I wrote a short async script that hit /api/chat for every live city in the system with the same question:

“What permits do I need for a kitchen remodel?”

Then I dumped the answers into a single file and read all of them.

Fourteen of thirty-five cities returned garbage or nothing. Root causes, in roughly the order I uncovered them:

Empty Indexes: Several cities had been flipped to live status before their scrape job had actually run successfully.
Bot Protection: Some cities were behind bot-protection systems. The scraper got a 200 response with an “Access Denied, please complete the captcha” page in the body. The system happily indexed the rejection notice as if it were code.
Stale Seeds: City websites redesigned, URLs 404’d, and the embedding model dutifully encoded “Page Not Found” as a vector.

The takeaway: one doesn’t know RAG quality until it is tested systematically. A quality gate script should be a deployment step, not an afterthought.

The HNSW Filtered Query Bug

Most cities had not just thin data; they were returning zero results for every query. I assumed I had broken the retrieval pipeline. I had not. I had broken the index.

I had recently migrated the database from per-city schemas into one shared embeddings table with a city column. The advantage was one HNSW (Hierarchical Navigable Small World) index instead of 60. The bug was that the HNSW index now spanned all cities globally.

The query I was running looked like this:

SELECT ... FROM embeddings
WHERE city = 'houston'
ORDER BY embedding <=> $1::vector
LIMIT 12;

What HNSW actually does on that query: it walks the graph and returns the 12 globally nearest vectors. Then PostgreSQL applies the WHERE city = 'houston' filter to those 12 rows. The global nearest neighbors are dominated by Denver (178k rows) and Austin (104k rows), both of which have very dense embedding spaces. After the filter, one gets zero Houston rows. The user sees “No results found.”

The fix was a one-line GUC (Grand Unified Configuration parameter) introduced in pgvector 0.8.0:

SET hnsw.iterative_scan = relaxed_order;
SELECT ... FROM embeddings WHERE city = $1
ORDER BY embedding <=> $2 LIMIT 12;

iterative_scan = relaxed_order tells the planner to keep expanding the HNSW search until it accumulates enough rows that also satisfy the WHERE clause.

There was a second, smaller bug hiding behind the first: SET and SELECT need to run on the same database connection. My code was using two separate pool.query() calls, which were grabbing different pooled clients. The SET was effectively a no-op for the subsequent SELECT. Switching to a dedicated pool.connect() for the pair fixed it.

The System Prompt: Banning the Apology

Even cities with good data were producing answers like: “Unfortunately, the retrieved sections don’t specifically address…”

The model’s default is to hedge when uncertain. One has to explicitly ban the behavior.

“Do NOT open with disclaimers, apologies, or ‘unfortunately’. Lead directly with the answer.”
“First share everything the retrieved sections DO say, then add ONE brief closing note if something is genuinely missing.”

The texture of the answers changed immediately. Leading with substance matters.

“What Can I Ask?”: Coverage Before the First Question

A user lands on a thin-data city, asks a question, gets a bad answer, and loses trust in the entire system.

The fix: surface coverage information upfront. GET /api/coverage/[cityId] queries the database for chunk counts and a random sample of breadcrumb paths. A small LLM (Large Language Model) then summarizes these into 3 - 4 plain-English sentences. The honest framing of what the system doesn’t cover pre-empts the trust-killer.

Hybrid Search: BM25 + Dense Vectors + RRF

Pure vector search fails on exact-match queries like “Section 420.6” or “Title 17”. Semantic search is great for intent; it’s terrible for legal citations.

The fix is hybrid search:

Dense pass: HNSW cosine similarity.
Sparse pass: PostgreSQL full-text search (BM25 ranking).
Fusion: Reciprocal Rank Fusion (RRF), score = Σ 1/(rank + 60).

Rank fusion sidesteps the normalization problem between cosine distances and BM25 (Best Match 25) scores. I blend them at a 0.7/0.3 ratio - dense still dominates, but BM25 gets to “vote” for exact matches.

The database side was handled with a GIN (Generalized Inverted Index) built CONCURRENTLY to avoid table locks:

CREATE INDEX CONCURRENTLY embeddings_fts_idx
ON embeddings USING gin (to_tsvector('english', coalesce(text, '')));

Contextual Retrieval: Scaling Anthropic’s Technique

This was the single biggest quality lever in the post-launch period. The core idea (from Anthropic’s research) is that chunks in isolation lack context. A chunk saying “Maximum height is 18 feet” could be about fences, ADUs, or sheds. By prepending a context sentence to each chunk before embedding, one preserves its “place” in the legal hierarchy.

The Prompt Engineering

The “meat” of this technique is the prompt used to generate the context. It needs to be precise and descriptive. My enrichment prompt looks like this:

“Write a single sentence (max 80 words) that situates this chunk within the document. Include the section number or title, the topic, and the key requirement or condition it establishes. Output ONLY the sentence.”

This forces the model (Gemini 2.5 Flash Lite) to ignore the noise and focus on the legal identity of the chunk.

Engineering at Scale: 600k Chunks

Enriching a few chunks is easy. Enriching 600,000 chunks across 60 cities is a distributed systems problem.

Parallelism: I used a ThreadPoolExecutor with 15 concurrent workers. This hit the “sweet spot” where throughput was maximized without triggering the 429 rate limits of the Gemini API.
Checkpointing: Processing 600k chunks takes ~7 hours. If the script crashes at hour 6, one doesn’t want to start over. I implemented a pickle-based checkpointing system that saves progress every 500 chunks.
Cloud Run Jobs: To run this in production, I packaged the script into a Cloud Run Job. I sharded the work across 4 parallel tasks, each handling a subset of the cities. Total cost: ~$57 in Gemini Flash Lite calls plus pennies in compute.

Measuring the Lift

The results were immediate and measurable. On my Oakland test set, contextual retrieval provided a +10% lift in Faithfulness and a +5% lift in Context Precision.

The reason? When a user asks about “ADU height,” the embeddings for chunks enriched with “This section establishes height limits for Accessory Dwelling Units (ADUs)…” are now much closer to the query than raw text chunks that just say “Maximum height is 18 feet.”

RAGAS: Building a Real Evaluation Loop

Building a RAG system without evaluation is roughly equivalent to refactoring code without tests. It works until it doesn’t, and one has no way to know when “doesn’t” starts.

When I first shipped, my evaluation “loop” was me typing five questions into the UI. That doesn’t scale to 60 cities. I needed a systematic way to measure quality. I built an evaluator inspired by the RAGAS (RAG Assessment) framework, using the LLM-as-a-Judge pattern.

The Golden Dataset

I hand-curated a “Golden Dataset” of 26 questions that represent the real diversity of user intent in this domain:

Procedural: “How do I schedule a building inspection?” or “How do I get a demolition permit?”
Legal/Quantitative: “What is the maximum lot coverage allowed?” or “What are the height and setback requirements for a fence?”
Ambiguous/Multi-part: “What permits do I need for a kitchen remodel?” (requires building, electrical, and plumbing context).
Negative/Out-of-scope: “What is the best restaurant near city hall?” (Testing if the system correctly rejects non-permit questions).

Having a fixed set of questions is critical. It allows one to A/B test changes - like swapping an embedding model or tweaking a prompt - and see exactly how the numbers move.

The Metrics: Faithfulness & Context Precision

The evaluator measures two core metrics on a 0.0 to 1.0 scale:

Faithfulness (The Hallucination Guard) This measures if the answer is grounded only in the retrieved context. The judge (Gemini 2.5 Pro) extracts every factual claim from the answer and verifies it against the context.

def score_faithfulness(question, answer, contexts, judge):
    # Step 1: Extract claims
    claims_raw = _gen(judge, f"List each distinct factual claim in this answer: {answer}")
    claims = [c.strip() for c in claims_raw.splitlines() if c.strip()]
    
    # Step 2: Verify each claim against context
    supported = 0
    for claim in claims:
        verdict = _gen(judge, f"Context: {contexts}\nClaim: {claim}\nIs this supported? YES/NO")
        if verdict.upper().startswith("YES"):
            supported += 1
    
    return supported / len(claims)

Context Precision (The Retrieval Guard) This measures if your search is actually finding the right needles in the haystack. It uses a ranking metric to ensure the most relevant chunks are at the top of the list.

def score_context_precision(question, contexts, judge):
    relevance = []
    for ctx in contexts:
        verdict = _gen(judge, f"Question: {question}\nContext: {ctx}\nIs this relevant? YES/NO")
        relevance.append(1 if verdict.upper().startswith("YES") else 0)

    # Compute Average Precision over the ranked list
    total_relevant = sum(relevance)
    if total_relevant == 0: return 0.0

    score, running = 0.0, 0
    for k, rel in enumerate(relevance):
        if rel:
            running += 1
            score += running / (k + 1)
    return score / total_relevant

The “Thinking Token” Trap

I used Gemini 2.5 Pro as the judge. Initially, I set max_output_tokens=8 for the YES/NO judge call, assuming a one-word answer would be fast and cheap.

It wasn’t. Gemini Pro uses internal “thinking” tokens before producing output. Those tokens count against the limit. With a limit of 8, the thinking tokens consumed the entire budget, and the model returned an empty string. My parser saw an empty string, assumed “NO”, and my first eval run showed 0% quality across every city.

The fix: Bump the budget to max_output_tokens=256. One only pays for what is used, so the ceiling is free, and it gives the model room to “think” before it commits to a YES.

The 5-City Results

I ran the 26-question set against five representative cities (130 evals total).

City	Faithfulness	Context Precision
Oakland	0.347	0.513
Berkeley	0.417	0.423
San Francisco	0.487	0.622
Irvine	0.572	0.099
Denver	0.290	0.319
Average	0.423	0.395

The takeaway: Faithfulness is relatively stable (0.29 - 0.57), meaning the generator is behaving consistently. But Context Precision is the variable. Irvine (0.10) is a retrieval emergency. The scraper likely missed the breadcrumb structure, leaving the search blind.

RAGAS turned “it feels better” into a number one could track per deploy.

Claude -> Gemini in the Live App

Economics forced a migration from Claude Sonnet to Gemini 2.5 Flash. Cost dropped from $0.04 to $0.005 per chat turn - an 8x reduction.

However, Gemini surfaced Citation Stacking: citing 12 identical chunks for one rule. “Maximum height is 18 feet [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12].”

The fix:

Deduplicate chunks by exact text match before they reach the model.
Aggressive Prompting: “Never write [1][2][3]…[12]. That is noise.”

The Cost of Scaling: Transitioning to Local LLMs

While Gemini is 8x cheaper than Claude, a “hobby” project can still rack up a bill during a heavy evaluation run or a viral spike in traffic. If one is looking to cap spend, the next logical step is to bring the execution local.

Local Options

In 2026, one does not need a massive server farm to run high-quality models. There are two main paths:

Ollama (Development & Prototyping): The “Docker for LLMs.” It is the easiest way to run models like Llama 4 or Qwen locally. It handles the quantization (compressing the model) and provides a simple local API.
vLLM (Production & Throughput): If one wants to serve multiple users at once, vLLM is the gold standard. It uses “PagedAttention” to handle concurrent requests much more efficiently than a standard setup.

Hardware: VRAM is King

The cost of “free” local execution is the upfront hardware investment.

The Budget Build: An RTX 3060 12GB (~$250 used) can run 8B-parameter models comfortably.
The Sweet Spot: An RTX 5060 Ti 16GB (~$500) can handle 14B-20B models, which are often the “sweet spot” for reasoning tasks.
The Apple Alternative: A Mac M4 Pro with 64GB of RAM is the best value for running massive 70B models, as its unified memory allows the GPU to use the entire system RAM.

Do I still need Google Vertex AI?

Strictly speaking, no. One can run a fully local RAG stack:

LLM: Run Llama 3 via Ollama locally.
Embeddings: Use an open-source model like BGE-M3 or nomic-embed-text locally instead of Vertex AI.
Database: Keep using Neon (PostgreSQL) for your vector store. Neon’s free/hobby tier is generous, and one only pays for storage and compute when the DB is “awake.”

The trade-off is maintenance vs. cost. Vertex AI is a managed service - it is always there, it scales, and one does not have to worry about a local power bill or GPU cooling. But for a heavy user, a $1,500 PC pays for itself in roughly 6 months of API savings.

Hardware Budget (June 2026)

If one is ready to make the jump, here is a cost-effective hardware recipe for high-performance local RAG:

Component	Budget Choice	Cost
GPU	Used RTX 3060 12GB	$200
PC	Used Dell OptiPlex 7080 MT	$130
PSU	New/Used 550W PSU	$60
Misc	Power Adapter / Shipping	$20
TOTAL		$410

Where to shop (June 2026):

eBay: The most reliable source for used GPUs. Look for sellers with high ratings and original packaging if possible.
Back Market / VIPOutlet: Excellent for finding “base” business desktops like the OptiPlex with a warranty.
FB Marketplace: Best locally for deals on gaming PCs being sold without a GPU by users who just upgraded.

The ROI Verdict: For an upfront investment of ~$410, one can eliminate the ~$20/month recurring hosting and API bill. This setup pays for itself in roughly 20 months. More importantly, one gains “instant” response times and the freedom to run 1,000 evaluations a day without checking a credit balance.

What’s Next

Summer Vacation: Taking a well-earned break before the next phase.
DIY LLM/Postgres System: Building the actual hardware and migrating the entire stack to a local, air-gapped environment.
Temporal Versioning: When was this section last updated?
Entity Extraction: Turning ordinance numbers and fee amounts into metadata filters.
Continuous Eval in CI (Continuous Integration): Catching regressions at PR (Pull Request) time.

Musings: The Engineering is in the Wrappers

In Part 1, I noted that the LLM is rarely the bottleneck. Phase 2 proved it. Everything I shipped - HNSW fixes, hybrid search, contextual embeddings, RAGAS - was a data-or-systems problem.

There is a temptation to attribute AI quality to the model itself. But the model is the most stable component. The real engineering happens in the “wrappers”: the chunker, the retriever, the database, and the evaluation loop.

Phase 1 was about investigating if it could be done. Phase 2 was about investigating if it could be engineered.

About This Project

PermitIQ was built on my own time. Total spend: $100 - 150, mostly on embeddings and evaluation. Storage and serving costs remain negligible.

Thanks for reading.

What I Learned Building a RAG System on Real, Messy Data

2026-05-17T00:00:00+00:00

Lessons from building a 60-city municipal code Q&A system.

Background

Over the past six weeks I have been itching to go deeper into the ML and AI space. I had already explored classical ML algorithms, tracing the evolution from ID3 decision trees through XGBoost, but I wanted to understand how RAG (Retrieval-Augmented Generation) systems actually work in practice, not just in blog posts and tutorials.

I wanted to build something real. Something with messy, inconsistent data. Something at a scale large enough (9 GB of data) that naive approaches would actually break. And I wanted to do it completely on my own equipment and personal accounts, for reasons I explain at the bottom of this article.

I picked municipal permit data as my domain. If you have ever tried to figure out whether you need a permit to add a deck, replace your windows, or build an ADU on your property, you know the problem: the information exists in public records but is buried inside thousands of pages of legal text scattered across city websites, PDF fee schedules, and legislative databases. I wanted to build a system that could answer plain English questions about building permits using the actual code as its source of truth.

The result is PermitIQ, a system covering 60+ US cities. This article is my honest account of how I built it.

How These Systems Are Typically Built

Before I get into what I built, it helps to understand the standard playbook for RAG systems, because I followed it pretty closely before deviating in a few important places.

A typical RAG pipeline looks like this:

Step 1: Collect documents. You gather your source material. In most tutorials this is a folder of PDFs or a Wikipedia dump. In production it is usually a web scraper, a database export, or an API.

Step 2: Chunk. You split documents into smaller pieces. Why? Because embedding models have token limits (usually 512 to 8192 tokens), and more importantly because a 50-page document embedded as one vector has diluted signal. You want each chunk to represent one coherent idea so that similarity search returns precise results.

Step 3: Embed. You run each chunk through an embedding model, which converts the text into a vector (an array of numbers, typically 768 or 1536 dimensions). Texts with similar meaning end up geometrically close in this vector space.

Step 4: Store. You save the vectors alongside the original text in a vector database. At query time you embed the user’s question, find the nearest vectors by cosine distance, and retrieve the corresponding text chunks.

Step 5: Generate. You pass the retrieved chunks to an LLM as context. The LLM reads the retrieved passages and writes an answer grounded in them.

The standard beginner stack for this is: LangChain or LlamaIndex for orchestration, OpenAI for embeddings, Pinecone or Chroma for vector storage, and GPT-4 for generation. It works fine for demos.

Where it breaks down in real production systems is in steps 1 and 2, which most tutorials treat as trivial. Getting clean, well-structured data is 90% of the work. I learned this the hard way.

My Architecture

Here is the complete data flow for PermitIQ:

Step 1: Scraping

Municipal data comes from three distinct source types, each requiring a different strategy.

Municode

Most US cities publish their municipal code on library.municode.com, a single-page Angular application. A plain HTTP request returns an empty shell. The content is loaded dynamically by JavaScript after the page initializes.

The obvious solution is to run a headless browser for every page. That works but is painfully slow. A 3,000-node city code would take hours.

My approach was to use Playwright once to load the page and intercept the internal API calls the Angular app makes. I captured the session cookies, CSRF tokens, client ID, product ID, and job ID, then switched to parallel httpx requests for all the actual content fetching.

async def _load_page(self, code_slug: str):
    context = await self.browser.new_context(user_agent=self._UA)
    page = await context.new_page()
    captured = {}

    async def on_response(response):
        if "/api/Clients/name" in response.url:
            captured["client"] = await response.json()
        elif "/api/Jobs/latest/" in response.url:
            captured["job"] = await response.json()

    page.on("response", on_response)
    await page.goto(seed_url, wait_until="networkidle")
    # Switch to parallel httpx for content using captured cookies

This reduced a 3,500-node city from 3+ hours to under 10 minutes with 20 concurrent content fetches.

I also tracked the full parent heading chain for every node during BFS expansion of the table of contents, so that each scraped section knew its full breadcrumb path: “Planning Code > Title 17 > Chapter 17.102 > ADUs > Setback Requirements.” More on why this matters in the chunking section.

City Websites

City websites are inconsistent (one wonders why there is no uniform government standard for how this data is published, but that is a story for another day). Some are plain HTML. Some are JavaScript SPAs that return 403 to non-browser requests. Some have sitemaps; many do not.

I built a two-tier crawler: a fast httpx + BeautifulSoup BFS crawler for plain HTML sites, and a Playwright-based BFS crawler for sites that block bot traffic. A discovery system finds the right entry points by trying, in order:

Parse /sitemap.xml and filter to permit-relevant URLs
Google Custom Search API restricted to the city domain
Probe 15 common paths like /permits, /building, /planning-and-zoning

One subtle bug I spent time on: many city sites redirect lowercase URLs to Title-Case paths. /growth-and-development becomes /Growth-and-Development after the redirect. I had to make all URL boundary checks case-insensitive to avoid breaking out of the crawl boundary.

Legistar

Municode lags reality by months. Ordinances passed recently are in Legistar, the legislative tracking system many cities use. I bulk-fetch the last 5 years of ordinances and resolutions via the Legistar REST API, and also do recursive resolution: any ordinance number cited in code text gets fetched, and any new numbers found in those texts get fetched too.

Content Quality Gate

Not everything a BFS crawler finds is worth keeping. City websites are full of cookie consent pages, navigation stubs, and template pages. I added a filter before saving anything:

def _content_is_useful(markdown: str) -> bool:
    lines = [l.strip() for l in markdown.splitlines() if l.strip()]
    non_link = [l for l in lines if not l.startswith("[") and "](http" not in l]
    if len(" ".join(non_link)) < 200:
        return False
    link_ratio = sum(1 for l in lines if "](http" in l) / len(lines)
    if link_ratio > 0.8:
        return False
    return True

Garbage chunks poison retrieval quality more than missing chunks. A smaller high-quality corpus consistently beats a massive noisy one.

YAML Frontmatter

Every saved markdown file gets a YAML frontmatter block:

---
source_url: https://library.municode.com/ca/oakland/codes/code_of_ordinances?nodeId=ABC123
city: oakland
doc_type: municode_section
breadcrumb: Planning Code > Title 17 > Chapter 17.102 > ADUs
section_number: 17.102.130
fetched_at: 2026-05-15T07:19:00Z
content_sha256: 7f9a3b...
---

This metadata survives into the embedding pipeline and eventually into the vector database, making it possible to filter by city, doc type, or section number at query time.

Step 2: Chunking

This is where most tutorials skip over the hard part. Splitting text intelligently is what separates useful retrieval from garbage retrieval.

Why Naive Chunking Fails

A fee schedule split at 1,000 characters might cut a table in half. A zoning section split mid-sentence loses the conditional clause that changes the entire meaning. A definition split from its header is orphaned text with no context.

I initially used simple token-count splitting. The results were bad. Retrieved chunks would be missing the heading that identified what code section they came from. Fee tables would be half-rendered. Conditions would be split from the rules they modified.

Docling HybridChunker

I switched to IBM’s Docling library with its HybridChunker. Docling converts markdown to a structured document model that understands headings, paragraphs, tables, and lists as distinct elements. The chunker splits along structural boundaries rather than arbitrary character counts, targeting 1,200 tokens per chunk.

Each chunk comes with chunk.meta.headings, the heading chain above that chunk in the document. I prepend this as a breadcrumb to the chunk text before embedding:

headings = list(chunk.meta.headings) if chunk.meta.headings else []
breadcrumb = " > ".join(headings)
embed_text = f"{breadcrumb}\n\n{chunk.text}" if breadcrumb else chunk.text

The difference this makes is significant. The embedding model sees “Planning Code > Residential Zones > ADUs > Setback Requirements: The minimum rear setback shall be 4 feet” instead of just “The minimum rear setback shall be 4 feet.” That context is what allows a query about ADU setbacks to reliably find this chunk.

Table Handling

html2text mangles HTML tables. Tables are exactly where fee schedules and permit requirement matrices live. I pre-process every

element with BeautifulSoup before passing the HTML to html2text, converting them to pipe tables:

def _table_to_pipe(table_tag) -> str:
    rows = table_tag.find_all("tr")
    md_rows = []
    for i, row in enumerate(rows):
        cells = [c.get_text(" ", strip=True) for c in row.find_all(["th", "td"])]
        md_rows.append("| " + " | ".join(cells) + " |")
        if i == 0:
            md_rows.append("| " + " | ".join("---" for _ in cells) + " |")
    return "\n".join(md_rows)

Step 3: Vectorizing

An embedding model converts text into a vector, an array of floating-point numbers in a high-dimensional space. Texts with similar meaning end up geometrically close. When I embed a user’s query, I can find the stored chunks whose vectors are nearest to the query vector. Those are the most semantically relevant passages.

Vertex AI text-embedding-005

I use Google’s text-embedding-005 model via Vertex AI. It produces 768-dimensional vectors, handles up to 2,048 tokens per input, and integrates natively with the rest of my GCP infrastructure.

inputs = [TextEmbeddingInput(text, "RETRIEVAL_DOCUMENT") for text in batch]
result = model.get_embeddings(inputs, output_dimensionality=768)
vectors = [e.values for e in result]

I process chunks in batches of 10 with exponential backoff for rate limits. At scale (60 cities, ~5,000 chunks each) this is around 300,000 embedding API calls. Rate limit handling is not optional at that volume.

Neon pgvector

I store vectors in Neon, a serverless PostgreSQL service, using the pgvector extension:

CREATE TABLE embeddings (
    id          TEXT PRIMARY KEY,
    city        TEXT NOT NULL,
    text        TEXT NOT NULL,
    source_url  TEXT,
    breadcrumb  TEXT,
    section_num TEXT,
    embedding   vector(768)
);
CREATE INDEX ON embeddings USING ivfflat (embedding vector_cosine_ops);

At query time:

SELECT text, source_url, breadcrumb, section_num
FROM embeddings
WHERE city = $1
ORDER BY embedding <=> $2
LIMIT 8

The <=> operator is pgvector’s cosine distance. The IVFFlat index keeps queries fast even with hundreds of thousands of vectors.

Step 4: Retrieval-Augmented Generation

RAG is the pattern that ties everything together. Instead of asking an LLM to answer from memory, where it will hallucinate or give outdated information, I:

Retrieve the most relevant passages from the vector database
Inject them into the LLM’s context as grounding material
Ask the LLM to answer only from those passages and cite its sources

const queryVector = await embed(userMessage)
const chunks = await queryChunks(queryVector, city, 8)

const context = chunks.map(c =>
  `[${c.section_num || c.breadcrumb}]\n${c.text}\n(Source: ${c.source_url})`
).join("\n\n---\n\n")

const systemPrompt = `You are a municipal code expert for ${city}.
Answer using ONLY the code sections provided below.
Cite section numbers. If the answer is not in the provided sections, say so.

${context}`

The key instruction is “answer using only the provided sections.” Without that constraint, Claude will fill gaps with plausible-sounding but potentially wrong information. With it, the model either gives a cited answer or honestly says it does not have the relevant section, which tells the user to check with the city directly rather than act on a hallucination.

Technologies: Initial Design vs. Final Implementation

Where I Started

My original stack was: Pinecone for vector storage, Voyage AI for embeddings, Cloud SQL (PostgreSQL on GCP) for relational data, and a simple static-seed scraper that required manually configured URLs per city. Markdown files lived on local disk. Chunking used generic token-count splits.

This is a reasonable and well-documented starting point. Pinecone is purpose-built for vector search. Voyage AI produces excellent embeddings. Cloud SQL is solid managed PostgreSQL.

Where I Ended Up

After hitting real-world data quality problems and scaling to 60 cities, the stack shifted significantly.

Embeddings: Voyage AI to Vertex AI text-embedding-005. I consolidated onto GCP to simplify billing and authentication. Vertex AI integrates natively with Cloud Run and IAM, eliminating a separate vendor. Performance on retrieval benchmarks is comparable.

Vector store: Pinecone to Neon pgvector. This was the biggest shift. I already needed relational storage for city metadata and job tracking. Consolidating onto one PostgreSQL instance with the pgvector extension eliminated Pinecone’s cost and reduced operational complexity. At my scale, the query performance is indistinguishable.

Storage: Local filesystem to Google Cloud Storage. Scaling to 60 cities with a scraper that runs on one machine and an embedder that runs on another made local storage impractical. GCS gives durable shared storage that both scripts can access.

Chunking: Token-count splits to Docling HybridChunker. This had the single largest impact on answer quality. Legal text has structure. Respecting that structure during chunking dramatically improves retrieval precision.

Scraper: Static seeds to auto-discovery. Adding sitemap parsing, Google Custom Search integration, and a Playwright fallback for sites that block plain HTTP made the system scalable. Adding a new city went from “manually configure 10 seed URLs” to “add one line to a config file.”

The LLM itself was never the bottleneck. Claude’s API is reliable and capable. The hard work was entirely in the data pipeline: getting the right text, structured correctly, chunked intelligently, with enough metadata to filter and cite accurately. This is the part that most RAG tutorials compress into three lines of code.

What I Learned

Building this taught me things I could not have gotten from reading papers or following tutorials.

Data quality is the entire game. I spent more time on the scraper and chunker than on everything else combined. The vector search and LLM layers are surprisingly forgiving when the input data is good. They are completely useless when it is bad.

Structure-aware chunking is not optional for legal or technical documents. Generic token splitting works fine for narrative text. For anything with tables, numbered sections, conditional clauses, and defined terms, you need a chunker that understands the document model.

Breadcrumbs matter more than I expected. A short section with no heading context is nearly impossible to retrieve reliably. Prepending the full heading path to each chunk before embedding was one of the highest-leverage changes I made.

Rate limits at scale require real retry logic. Hitting 300,000 embedding API calls across 60 cities exposed every flaw in my backoff code. Handling the Retry-After header correctly and skipping failed batches gracefully rather than crashing took iteration.

Costs are lower than you might expect, but not zero. The bulk of my ~$100 GCP spend so far went to the initial Vertex AI embedding run across 60 cities, not to storage. The 9 GB of scraped markdown in Google Cloud Storage costs pennies a month. Neon’s Pro plan handles the pgvector database (roughly 1 GB of vectors once compressed) at around $19/month. Cloud Run for the app is negligible at hobby traffic levels. The catch is re-embedding: whenever a city updates its code and you re-scrape a large section, you pay for those embedding calls again. For 60 cities with codes that change constantly, that ongoing cost adds up if you run updates frequently.

What Is Next

The system works well but has clear headroom:

Temporal versioning. Municipal codes change constantly. Detecting when a section changes and re-embedding only the diff would keep the index current without full re-scrapes.
Entity extraction. Pulling ordinance numbers, zoning codes, permit types, and fee amounts into structured metadata would enable hybrid retrieval combining vector search with metadata filters.
Cross-city comparison. “How does Oakland’s ADU policy compare to Berkeley’s?” is architecturally straightforward but requires careful prompt design to avoid the model confusing the two cities’ rules.

The data pipeline is what makes this defensible. Anyone can call an LLM API. Clean, well-structured, properly chunked municipal code with breadcrumb-enriched embeddings for 60+ cities takes real engineering work to build and maintain.

Got Suggestions?

If you have built something similar, a RAG system on legal, regulatory, or government data, I would love to hear what worked for you. Specifically: different chunking strategies, alternative embedding models, retrieval approaches I have not tried, or ways you handled data quality at scale. Drop a comment or reach out directly. I am still learning and genuinely curious what approaches others have taken.

Musings

Here is something I kept thinking about while building this.

Oakland has roughly 440,000 residents. Berkeley has about 120,000. Richmond clocks in around 115,000. These are not megacities. And yet each one maintains thousands of pages of code specifying exactly how you may remodel your kitchen, what permits you need to replace your bathtub, and the precise rules governing whether you can install a hot tub in your backyard.

Someone wrote those pages. Someone updates them. Someone fields the phone calls when a confused homeowner cannot find the answer. And for decades, the only way to access any of it was to either call City Hall, hire a permit expediter, or wade through PDFs that look like they were formatted in 1997.

I find it equal parts absurd and endearing. There is something very human about a government body carefully documenting the rules for backyard hot tubs. I just think it should be easier to find.

About This Project

I built PermitIQ entirely on my own time, out of genuine curiosity. I am an engineer who likes to understand how things work by building them. Reading about RAG systems was not enough — I wanted to get my hands dirty, hit real problems, and figure out solutions. That is how I learn, and that is how I keep up with where technology is headed.

All the code, infrastructure, and data here are my own work, funded out of my own pocket. I deliberately chose to host this on my own GCP account rather than use any infrastructure from my employer. A misconfiguration in a personal side project should never be a vector for a security incident that could affect an entirely separate organization’s resources.

40 Years of Decision Trees: From ID3 to XGBoost

2026-03-01T00:00:00+00:00

Overview

This is an account of implementing and comparing four decades of decision tree algorithms, from Quinlan’s foundational ID3 (1986) through modern gradient boosting with XGBoost and LightGBM. I built ID3 and C4.5 from scratch and tested all four against identical UCI datasets.

The short version: XGBoost brought a +7.43% accuracy improvement over ID3 with 81% less overfitting. On complex real-world datasets the gap widens to +65-70%.

Full code is on GitHub.

The 40-Year Evolution

Algorithm	Year	Avg Accuracy	vs ID3	Overfitting	Speed
ID3	1986	90.72%	Baseline	9.28%	1.0x
C4.5	1993	91.78%	+1.06%	8.22%	2.4x
XGBoost	2014	98.15%	+7.43%	1.74%	6.7x

On the Tic-Tac-Toe dataset, XGBoost achieves 98.26% accuracy vs ID3’s 76.74% — a 21.53% improvement while reducing overfitting by 92%.

The Algorithms

ID3 (1986) — Information Gain

Quinlan’s original algorithm selects attributes that maximize information gain at each node using Shannon entropy.

gain(A) = I(p,n) - E(A)

Limitations: overfits training data, biased toward multi-valued attributes, handles only discrete attributes.

C4.5 (1993) — Gain Ratio and Pruning

Key improvements over ID3:

Gain Ratio: normalizes information gain to reduce bias toward multi-valued attributes
Pessimistic Error Pruning: post-prunes trees to improve generalization
Continuous attributes: automatically finds optimal thresholds

XGBoost (2014) — Gradient Boosting

Builds 100+ sequential trees, each correcting previous trees’ errors. Uses second-order Taylor approximation and L1/L2 regularization. This is what dominated Kaggle from 2015 to 2017 and remains the industry standard for tabular data.

LightGBM (2017) — Faster at Scale

Microsoft’s improvement on XGBoost: histogram-based splitting, leaf-wise tree growth, and Gradient-based One-Side Sampling (GOSS). Dramatically faster on large datasets.

Results: Small Datasets

Tic-Tac-Toe Endgame (958 instances)

Algorithm	Test Accuracy	Overfitting
ID3	76.74%	23.26%
C4.5	79.17%	20.83%
XGBoost	98.26%	1.74%

Mushroom Classification (8,124 instances)

All algorithms hit 100% — ceiling effect. C4.5 produces a 13.8% smaller tree than ID3 through pruning.

Results: Large Datasets (Where It Really Matters)

Adult Income Dataset (48K examples)

Algorithm	Accuracy	Time
ID3	16.28%	0.1s
C4.5	81.46%	4.7s
XGBoost	87.47%	0.2s
LightGBM	87.07%	0.7s

ID3 essentially fails on large real-world data. XGBoost trains faster than C4.5 while achieving +6% better accuracy.

Forest Cover Type (581K examples)

XGBoost hits 84.53% vs C4.5’s 62.07% — a 22% improvement on 7-class classification.

Decision Tree Visualizations

ID3

36 nodes, depth 6. Unpruned, perfectly fits training data.

C4.5

35 nodes, depth 7. Slightly more compact despite greater depth, thanks to pruning.

XGBoost Ensemble

100 sequential trees. Each corrects residual errors from the ensemble.

What I Learned

Dataset size is everything. On small clean datasets (under 1K rows), all four algorithms perform similarly. On large real-world datasets, gradient boosting wins by 20-70%.

C4.5’s improvements are real but modest. Pruning and gain ratio make a measurable difference in generalization but not a dramatic one. The jump from tree-based methods to ensemble methods is where performance really changes.

Implementing from scratch is worth it. Reading Quinlan’s 1986 paper and then implementing ID3 line-by-line gave me an intuition for how tree splits work that I could not have gotten from calling sklearn.tree.DecisionTreeClassifier.

XGBoost is still relevant in 2026. Despite deep learning dominating vision and language tasks, XGBoost and LightGBM remain the go-to for structured tabular data in production. The companies using them — Uber, Airbnb, Netflix, Microsoft — are not being lazy. They are using the right tool for the problem.

Running the Code

git clone https://github.com/snijsure/id3-decision-tree
cd id3-decision-tree

# Full 40-year evolution comparison
./run_experiment.sh evolution

# Modern comparison (all 4 algorithms)
./run_experiment.sh modern

# Large dataset tests
./run_experiment.sh large_dataset

This was the ML foundation that eventually led me to build PermitIQ — a RAG system on 60+ cities of municipal code data. Different problem space, but the same instinct: learn by building something real.