, generative AI took off and RAG showed up as the standard answer for “we have documents, we want to ask questions.” The pitch sounded miraculous. The implementation everyone described was the same one, over and over:
Vendors converged on it. Consulting decks converged on it. Conference talks converged on it.

Then the deployments started shipping, and the results were often disappointing.
And the team’s reflex, every time, was to pull more tools from the same toolbox:
The framing was always the same: “this is an IT problem. Better infrastructure, better tools, better models will fix it.”
I started looking at it myself, on real enterprise documents, with real domain experts in the room. My experience didn’t match that framing.
The work that actually made a real difference wasn’t infrastructural. It was engineering, plus understanding the business domain, plus a little of the underlying math. Not deep math. Just enough to see what an embedding actually measures, what a reranker actually does, why a particular trick helps in some cases and hurts in others. And then, the piece most teams skip: knowing the documents the system is supposed to answer questions on. Who reads them. What they contain. What vocabulary the experts use. What questions come up week after week.
Most companies aren’t Google. They’re not research labs either. They’re not running open-domain QA over the open web. They’re not training their own embedding models. They have a few core document types, a few dozen domain experts who already know the corpus inside out, and a recurring set of questions that need answers with citations and an audit trail. The right architecture for that context is not what vendor decks pitch and not what research papers chase. It’s an architecture that amplifies the experts and uses cheap, predictable retrieval where it can.
Most of the RAG systems I’ve seen in enterprise production are worse than a hundred-line Python script. The basics are broken, and stacking more on top doesn’t help. Embeddings are too fuzzy in meaning to pick the right passage, and parsing is sloppy enough that the LLM gets garbage in, garbage out.
When a system like that starts to break, the standard reflex is to add layers:
Each layer adds plausibility to the demo. None of them fixes the foundation: there is still no way to tell whether the retrieved passages are the right ones, and still no way to explain to a user why a particular page came back.
The script we’ll build in the first article fits in about a hundred lines and has no vector database, no framework, and no agents.
It takes a PDF and a question, parses, retrieves the top three pages by simple cosine similarity, sends them to an LLM with a Pydantic schema, and returns a structured answer with line citations and a highlighted source PDF.
That script is more verifiable and more useful than many of the production systems I’ve seen up close. The gap between the two isn’t prompt engineering, and it isn’t a better retrieval algorithm. It comes from three habits the industry skips: knowing the documents, knowing what the experts already know, and not confusing RAG with machine learning.
This series wires those habits into a four-brick pipeline: document parsing, question parsing, retrieval, generation, with an optional PDF annotation step that hands the citation back to the reader.

In May 2020, Patrick Lewis and colleagues at Facebook AI Research coined the term in Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al. 2020, arXiv preprint). Their abstract names the three failings the architecture was meant to fix, quoted directly:
Pre-trained models “cannot easily expand or revise their memory, can’t straightforwardly provide insight into their predictions, and may produce ‘hallucinations’.” The fix combined a generator (BART) with a dense vector index over Wikipedia, accessed at inference time. The architectural move that mattered: pull a passage from a corpus, hand it to the LLM, let it generate from that context rather than from training-time memory alone.
Those three failings map cleanly onto the three properties enterprise RAG fights for: corpus freshness, citations, grounded answers. The series is a direct continuation of that 2020 line of thinking, applied to enterprise constraints.
For most developers today, “RAG” has narrowed to mean one specific recipe: a vector store, embedding-similarity retrieval, and an LLM at the end.
The series will keep using the word, but in its broader original sense: information extraction, information search, and question answering over a corpus of documents.
The retrieval mechanism is a design choice the architecture admits, not part of the definition. Many of the enterprise pipelines this series defends do not use a vector store at all; some use it as one channel among several, never as the foundation. When the announcement says “RAG”, read it in that broader sense.
The popular framing of RAG (the LLM writes a fluid natural-language answer from retrieved context) under-describes what enterprises actually do with it.
The bulk of the work is information extraction: pulling specific values from documents, with the LLM acting as a structured reader rather than as a writer.
The LLM reads the retrieved passage, identifies the answer, and returns it in a typed schema with line citations. That is extraction, with some light reformatting and cleanup. It is not generation in the creative sense.
Where the LLM is allowed to compose new text in enterprise work, it does so over content the system has already extracted and validated. The series defends a sharp separation: phase one extracts the relevant information with citations, validates it, audits it. Phase two, on top of phase one’s typed output, may compose a longer narrative (a draft notice, a summary paragraph for a report).
Two phases, two LLM calls, two audit surfaces. The audit trail collapses when one LLM call mixes retrieval, extraction, and creative composition. The architecture refuses that conflation.
The 2020 paper picked Augmented over alternatives like Grounded or Conditioned. The word choice carries weight. In the 2020 framework, the generator is free to blend its parametric memory with the retrieved passages. The LLM keeps using what it learned during pre-training and consults the retrieval. Two memories, mixed. Augmented presupposes that something is already there; retrieval adds to it. Grounded would have meant the opposite: the generation rests on the retrieval, anchored to it, and the model is constrained not to stray from what was retrieved.
Enterprise production inverts that assumption. Every factual claim must be backed by a retrieved passage; the LLM’s parametric memory is exiled from the factual content of the answer and kept only for procedural use: grammar, schema-following, verbatim span extraction, arithmetic on cited values, deduction over retrieved facts. The shift from augmented to grounded is small lexically and large operationally. When the LLM rephrases a retrieved clause into a JSON field coverage_amount: 50000, the rephrasing follows English grammar and JSON syntax: that is procedural, and we keep it. When it fills a valid_until: "2027-12-31" field with a date that is not in the retrieved text, that is factual, and we block it.
The series keeps the architecture from the 2020 paper and narrows what the LLM is allowed to do with its parametric memory.

A million-token window does not collapse the enterprise corpus to one prompt. The corpus is thousands to hundreds of thousands of documents, and finding the right one still has to happen before any LLM call. And a long-context answer drawn from a million-token blob cannot tell the user which page backs which claim. RAG with line-level citations does.
One objection comes up repeatedly when the series is pitched, and it pushes toward the broader name. Two scope claims complete the picture : what “Enterprise” really means as an architectural constraint, and which corpus shape the series handles.
RAG, in its strict sense, is retrieval-augmented question answering. The architecture the series defends covers more than that. Classification at ingestion, field extraction at scale, versioning, SQL aggregation, evaluation, security: several of these are not RAG in any standard sense. The SQL agent of Article 17 isn’t RAG at all; it is the point where retrieval ends and data systems take over. The follow-up volume adds translation, summarization, side-by-side comparison, redaction; these are also not RAG. “Document Intelligence” names the broader work; “RAG” is one of its modes, specifically the question-answering one.

The Enterprise qualifier is not a market segment. It is a constraint. The corpus is controlled, not the open web. The expert is in the loop, and the system amplifies what they already know. The audit trail is mandatory because every answer can be challenged. The dispatcher is deterministic because reproducibility matters. Open-domain assistants make different trade-offs. The series is for engineers building inside that constraint, and every architectural choice in it follows from it.
The series’ main case is a corpus of homogeneous, independent documents: a few thousand to a few hundred thousand PDFs of the same type. When the corpus mixes several types, the first move is to classify into groups (Article 15), then run a homogeneous pipeline per group. Each document is read on its own ; the corpus index sits on top of all of them.
A case file (a credit application, a contract renewal, an insurance claim) is a small bundle of heterogeneous PDFs about one entity. The series stays on PDFs throughout, so a handful of small files can simply be concatenated and treated as a single document. This is where the table of contents pays off : several PDFs, each with its own TOC, read like one larger document with nested sections after concatenation, and the retrieval brick (Article 7) already knows how to navigate it. The follow-up volume builds proper case-file routing with per-document-type signals when the bundle gets too varied or too large to glue together.
The harder shape is many case files, many document types per file: hundreds of cases, each with five to fifty heterogeneous documents inside. The orchestration on top of that exceeds what a single corpus index alone offers. The series names the case for honesty about scope and leaves the full treatment to the follow-up volume ; the primitives built in Parts IV-V carry over.
If your archive is one of the homogeneous shapes, the series covers it end to end. If it is case-file shaped, expect this series to take you most of the way, and the follow-up volume to finish the job.
Enterprise Document Intelligence is a brick-by-brick series for engineers and data scientists building RAG on enterprise documents: contracts, technical reports, regulatory filings, where a wrong answer triggers a regulatory finding, a contract dispute, or a refund to a client. The series focuses on PDF as the document format, the dominant form for the documents enterprises actually want to query. Other formats (Word, Excel, PowerPoint, email) need their own parsing and structure logic and are covered by follow-up work.
The “amplify the expert” stance translates into concrete architectural choices the series defends, each tied to specific articles:
Behind those choices sit three positive principles that recur in every article. The work is pragmatic and expertise-driven: every choice gets judged on whether it builds on the accumulated knowledge of the people who already understand the documents. The architecture is pyramidal engineering: four named bricks (parsing, question parsing, retrieval, generation), each one a handful of named functions with explicit inputs and outputs, so a senior engineer can trace a request end-to-end in minutes. The data is relational at every brick: parsing produces tables, question parsing produces tables, retrieval queries them, generation writes a typed row back, never raw strings at any junction.

Three philosophical positions follow from the above and recur throughout: embeddings are not magic (Article 2), RAG is not machine learning (Article 3), evaluation is per-failure-mode, not aggregate (Article 20).
These positions come from building RAG in regulated industries: insurance, legal, financial services. They aren’t the only valid positions. They’re the ones that have held up in production where a wrong answer triggers a refund, a fine, or a lawsuit.
Build the minimal pipeline, watch where it cracks, reframe the discipline, then locate your own case before going further. Each article sets up the next, so the four can be read in one sitting before tools or frameworks enter the picture.

Parsing → question parsing → retrieval → generation. The four bricks that carry the rest of the series. What sets the architecture apart from generic RAG: every brick produces relational structured data (linked DataFrames, typed rows), never raw strings. The pipeline can be inspected, replayed, and audited at every junction.

parsing_method column tracks per-row provenance so adaptive parsing can mix fitz and Azure on the same document.The whole pipeline assembled from Part II’s improvements, then extended. Article 1 ran the minimal pipeline end-to-end; Part II then improved each brick in isolation. Article 9 closes that loop: same kind of demo as Article 1, on the same paper, with every Part II improvement wired in together. Articles 10-12 then add specific complexity patterns: adaptive parsing (where generation tells parsing to escalate), cross-references, listing. Article 13 assembles every pattern into the orchestrator, wires the feedback loops that bound iteration, and is where the team’s accumulated wisdom lives.

Naive embedding search over thousands of documents fails. The same four bricks still apply, but each one needs a structural index in front of it. Article 14 sets the thesis with a minimal corpus pipeline run on five NIST PDFs, the kind of baseline that wastes four out of five LLM calls because nothing filtered the corpus first. Article 15 fixes the input side: a hierarchical cascade of questions populates a relational corpus_index, one row per document, columns for the searchable fields. Article 16 formalises the ontology that drives the cascade as five small tables hand-curated by the expert, and explains why a curated relational layer beats an LLM-extracted knowledge graph on every operational axis. Article 17 wires the query side: parse the question, filter the index, run the document-level pipeline only on the candidates the SQL agent returned.

corpus_qa_baseline run on five NIST PDFs that shows where the waste is.corpus_index per document, with two execution paths (regex on filename, single-doc pipeline otherwise) and nomenclature normalisation (raw extraction → canonical entity). Real runs on 24 NIST PDFs and 30 arXiv papers.corpus_index close the architecture.The system is built. Now run it for years. The code architecture that lets several developers work in parallel, the storage layer that holds the replayable artifacts, per-failure-mode evaluation against a curated dataset (no aggregate-accuracy mirages), cost and latency measured as SQL aggregations on the same storage, and the security envelope wrapping all of it. RAG-specific concerns that generic ML-ops and generic security guides don’t address.

file_id, question_id). Long format for storage, wide views for output. The llm_raw_json column and the query_log table are what evaluation, cost, and audit all read from.pandas.groupby on a results table joined from Article 19’s storage. Aggregate metrics lie ; per-question-type metrics tell the truth.Each one is a cross-cutting practical concern that touches several main articles but doesn’t belong inside any single part.
line_df, separate table_df, columnar with named and typed columns, columnar but heterogeneous), a per-table diagnostic on five orthogonal axes, and a handful of idempotent operations that move tables between levels. Most tables stay at the simplest level; only the few that need it pay the cost of escalation.Each article stands on its own. Each builds on the previous ones in a way that should feel natural: the same minimal pipeline from Article 1 grows into the architecture of Articles 18-19 and the security envelope of Article 22, with every addition motivated by a specific failure observed earlier.
Engineers building RAG systems on enterprise documents. Legal, insurance, financial services, regulated industries broadly, anywhere the cost of a wrong answer is measurable. If you’ve shipped a RAG system that worked on demos and broke on real users, this series is for you.
Data scientists who feel that ML intuitions don’t quite map to RAG. They don’t. The series makes the difference clear and actionable.
Tech leads making architectural decisions. When to use a vector database. When not to. When agentic patterns are worth their cost. When they don’t. When to invest in deeper parsing. The series is opinionated on these calls and explains the reasoning.
Teams without internal experts on the documents. The series assumes you have, or can get to, the people who already know your corpus:
Almost every architectural choice in the series amplifies that expertise. If you’re building open-domain QA on a corpus nobody internal understands, the choices here will not transfer. There are settings where general-purpose retrieval and autonomous agents make more sense; this series is not about those.
Researchers on the frontier. This series is about production engineering, not novel methods. It cites recent research where relevant but doesn’t try to advance it.
Anyone looking for a magic framework. The series is the opposite. It’s about understanding what’s underneath the frameworks well enough to make deliberate choices. Sometimes that means using a framework. Often it means writing a hundred lines of plain code that work better than what the framework gave you.
The series focuses on RAG over PDF documents: search and generation for question answering. It doesn’t cover other document formats (Word, Excel, PowerPoint, email), side-by-side document comparison, structured data alongside documents (databases), translation pipelines, large-scale summarization, document generation, or autonomous agents on documents.
These are real enterprise needs. They’re left out because they’re operationally different from RAG-on-PDF. Mixing them in produces the confused architectures the series is trying to help readers avoid.
A follow-up volume, planned for after this one closes, picks up each on its own terms: other document formats (Word / Excel / PowerPoint / email), side-by-side comparison, translation, summarization, structured data alongside documents, document generation. Same engineering discipline, applied to different problem shapes.
The articles will publish daily, in order, starting with Article 1: A Minimal RAG, From PDF to Highlighted Answer.
It builds the entire pipeline in about a hundred lines of Python. It sets up every article that follows by surfacing the questions a working minimal version naturally raises.
If you’re building RAG in production and you think the industry’s defaults are wrong, this series is what to do about it.