Enterprise Document Intelligence: A Series on Building RAG Brick by Brick, from Minimal to Corpus scale

Articles published in the series:

Baseline Enterprise RAG, from PDF to highlighted answer. The pipeline that consumes the tables Azure Layout fills.
Embeddings Aren’t Magic: The Predictable Failure Modes of RAG Retrieval. How retrieval matches the cell text and figure OCR this engine recovers.
Rerankers Aren’t Magic Either: When the Cross-Encoder Layer Is Worth the Cost. Re-scoring the chunks built from these rows.
RAG is not machine learning, and the ML toolkit solves the wrong problem. Why parsing is engineering, not model training.
From regex to vision models: which RAG technique fits which problem. Which parsing technique fits which document.

, generative AI took off and RAG showed up as the standard answer for “we have documents, we want to ask questions.” The pitch sounded miraculous. The implementation everyone described was the same one, over and over:

chunk the documents,
push the chunks into a vector store,
embed the question,
retrieve top-k by cosine similarity, optionally rerank,
send the hits to an LLM

Vendors converged on it. Consulting decks converged on it. Conference talks converged on it.

*The RAG recipe everyone described: chunk, vector store, top-k cosine, optional rerank, LLM – Image by author*

Then the deployments started shipping, and the results were often disappointing.

Users didn’t trust the answers.
Citations were vague or missing.
Retrieved passages were beside the point as often as they were useful.

And the team’s reflex, every time, was to pull more tools from the same toolbox:

a stronger model,
a longer context window,
a better reranker,
more MLOps for the production side.

The framing was always the same: “this is an IT problem. Better infrastructure, better tools, better models will fix it.”

I started looking at it myself, on real enterprise documents, with real domain experts in the room. My experience didn’t match that framing.

The work that actually made a real difference wasn’t infrastructural. It was engineering, plus understanding the business domain, plus a little of the underlying math. Not deep math. Just enough to see what an embedding actually measures, what a reranker actually does, why a particular trick helps in some cases and hurts in others. And then, the piece most teams skip: knowing the documents the system is supposed to answer questions on. Who reads them. What they contain. What vocabulary the experts use. What questions come up week after week.

Most companies aren’t Google. They’re not research labs either. They’re not running open-domain QA over the open web. They’re not training their own embedding models. They have a few core document types, a few dozen domain experts who already know the corpus inside out, and a recurring set of questions that need answers with citations and an audit trail. The right architecture for that context is not what vendor decks pitch and not what research papers chase. It’s an architecture that amplifies the experts and uses cheap, predictable retrieval where it can.

Most of the RAG systems I’ve seen in enterprise production are worse than a hundred-line Python script. The basics are broken, and stacking more on top doesn’t help. Embeddings are too fuzzy in meaning to pick the right passage, and parsing is sloppy enough that the LLM gets garbage in, garbage out.

When a system like that starts to break, the standard reflex is to add layers:

a re-ranker,
a fine-tuned embedding model nobody can tell is helping,
a query-rewriter agent,
a grader agent,
an orchestrator framework that turns every question into ten LLM calls.

Each layer adds plausibility to the demo. None of them fixes the foundation: there is still no way to tell whether the retrieved passages are the right ones, and still no way to explain to a user why a particular page came back.

The script we’ll build in the first article fits in about a hundred lines and has no vector database, no framework, and no agents.

It takes a PDF and a question, parses, retrieves the top three pages by simple cosine similarity, sends them to an LLM with a Pydantic schema, and returns a structured answer with line citations and a highlighted source PDF.

That script is more verifiable and more useful than many of the production systems I’ve seen up close. The gap between the two isn’t prompt engineering, and it isn’t a better retrieval algorithm. It comes from three habits the industry skips: knowing the documents, knowing what the experts already know, and not confusing RAG with machine learning.

This series wires those habits into a four-brick pipeline: document parsing, question parsing, retrieval, generation, with an optional PDF annotation step that hands the citation back to the reader.

*Four bricks plus PDF annotation, with the data named on every arrow – Image by author*

The four-brick pipeline the series defends, with the data named on every arrow — *Four bricks plus PDF annotation, with the data named on every arrow – Image by author*

1. How RAG is used in enterprise

1.1 The 2020 paper: retrieval as context

In May 2020, Patrick Lewis and colleagues at Facebook AI Research coined the term in Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al. 2020, arXiv preprint). Their abstract names the three failings the architecture was meant to fix, quoted directly:

Pre-trained models “cannot easily expand or revise their memory, can’t straightforwardly provide insight into their predictions, and may produce ‘hallucinations’.” The fix combined a generator (BART) with a dense vector index over Wikipedia, accessed at inference time. The architectural move that mattered: pull a passage from a corpus, hand it to the LLM, let it generate from that context rather than from training-time memory alone.

Those three failings map cleanly onto the three properties enterprise RAG fights for: corpus freshness, citations, grounded answers. The series is a direct continuation of that 2020 line of thinking, applied to enterprise constraints.

1.2 What “RAG” means in this series

For most developers today, “RAG” has narrowed to mean one specific recipe: a vector store, embedding-similarity retrieval, and an LLM at the end.

The series will keep using the word, but in its broader original sense: information extraction, information search, and question answering over a corpus of documents.

The retrieval mechanism is a design choice the architecture admits, not part of the definition. Many of the enterprise pipelines this series defends do not use a vector store at all; some use it as one channel among several, never as the foundation. When the announcement says “RAG”, read it in that broader sense.

1.3 Extraction comes first in enterprise

The popular framing of RAG (the LLM writes a fluid natural-language answer from retrieved context) under-describes what enterprises actually do with it.

The bulk of the work is information extraction: pulling specific values from documents, with the LLM acting as a structured reader rather than as a writer.

An underwriter needs a coverage amount, a deductible, an effective date.
A compliance officer needs the list of clauses that survive termination.
A paralegal needs the named parties of a contract.

The LLM reads the retrieved passage, identifies the answer, and returns it in a typed schema with line citations. That is extraction, with some light reformatting and cleanup. It is not generation in the creative sense.

Where the LLM is allowed to compose new text in enterprise work, it does so over content the system has already extracted and validated. The series defends a sharp separation: phase one extracts the relevant information with citations, validates it, audits it. Phase two, on top of phase one’s typed output, may compose a longer narrative (a draft notice, a summary paragraph for a report).

Two phases, two LLM calls, two audit surfaces. The audit trail collapses when one LLM call mixes retrieval, extraction, and creative composition. The architecture refuses that conflation.

1.4 The shift from augmented to grounded

The 2020 paper picked Augmented over alternatives like Grounded or Conditioned. The word choice carries weight. In the 2020 framework, the generator is free to blend its parametric memory with the retrieved passages. The LLM keeps using what it learned during pre-training and consults the retrieval. Two memories, mixed. Augmented presupposes that something is already there; retrieval adds to it. Grounded would have meant the opposite: the generation rests on the retrieval, anchored to it, and the model is constrained not to stray from what was retrieved.

Enterprise production inverts that assumption. Every factual claim must be backed by a retrieved passage; the LLM’s parametric memory is exiled from the factual content of the answer and kept only for procedural use: grammar, schema-following, verbatim span extraction, arithmetic on cited values, deduction over retrieved facts. The shift from augmented to grounded is small lexically and large operationally. When the LLM rephrases a retrieved clause into a JSON field coverage_amount: 50000, the rephrasing follows English grammar and JSON syntax: that is procedural, and we keep it. When it fills a valid_until: "2027-12-31" field with a date that is not in the retrieved text, that is factual, and we block it.

The series keeps the architecture from the 2020 paper and narrows what the LLM is allowed to do with its parametric memory.

*Academic RAG blends stored and retrieved knowledge ; enterprise grounds answers in retrieval only – Image by author*

1.5 Long context isn’t a substitute

A million-token window does not collapse the enterprise corpus to one prompt. The corpus is thousands to hundreds of thousands of documents, and finding the right one still has to happen before any LLM call. And a long-context answer drawn from a million-token blob cannot tell the user which page backs which claim. RAG with line-level citations does.

2. Why “Enterprise Document Intelligence” rather than “Enterprise RAG”

One objection comes up repeatedly when the series is pitched, and it pushes toward the broader name. Two scope claims complete the picture : what “Enterprise” really means as an architectural constraint, and which corpus shape the series handles.

2.1 RAG names one mode of the work, not all of it

RAG, in its strict sense, is retrieval-augmented question answering. The architecture the series defends covers more than that. Classification at ingestion, field extraction at scale, versioning, SQL aggregation, evaluation, security: several of these are not RAG in any standard sense. The SQL agent of Article 17 isn’t RAG at all; it is the point where retrieval ends and data systems take over. The follow-up volume adds translation, summarization, side-by-side comparison, redaction; these are also not RAG. “Document Intelligence” names the broader work; “RAG” is one of its modes, specifically the question-answering one.

*Volume 1: deep on RAG-QA over PDFs. Volume 2: other tasks, same discipline – Image by author*

2.2 “Enterprise” as an architectural constraint

The Enterprise qualifier is not a market segment. It is a constraint. The corpus is controlled, not the open web. The expert is in the loop, and the system amplifies what they already know. The audit trail is mandatory because every answer can be challenged. The dispatcher is deterministic because reproducibility matters. Open-domain assistants make different trade-offs. The series is for engineers building inside that constraint, and every architectural choice in it follows from it.

2.3 The shape of the corpus the series handles

The series’ main case is a corpus of homogeneous, independent documents: a few thousand to a few hundred thousand PDFs of the same type. When the corpus mixes several types, the first move is to classify into groups (Article 15), then run a homogeneous pipeline per group. Each document is read on its own ; the corpus index sits on top of all of them.

A case file (a credit application, a contract renewal, an insurance claim) is a small bundle of heterogeneous PDFs about one entity. The series stays on PDFs throughout, so a handful of small files can simply be concatenated and treated as a single document. This is where the table of contents pays off : several PDFs, each with its own TOC, read like one larger document with nested sections after concatenation, and the retrieval brick (Article 7) already knows how to navigate it. The follow-up volume builds proper case-file routing with per-document-type signals when the bundle gets too varied or too large to glue together.

The harder shape is many case files, many document types per file: hundreds of cases, each with five to fifty heterogeneous documents inside. The orchestration on top of that exceeds what a single corpus index alone offers. The series names the case for honesty about scope and leaves the full treatment to the follow-up volume ; the primitives built in Parts IV-V carry over.

If your archive is one of the homogeneous shapes, the series covers it end to end. If it is case-file shaped, expect this series to take you most of the way, and the follow-up volume to finish the job.

3. What this series is

Enterprise Document Intelligence is a brick-by-brick series for engineers and data scientists building RAG on enterprise documents: contracts, technical reports, regulatory filings, where a wrong answer triggers a regulatory finding, a contract dispute, or a refund to a client. The series focuses on PDF as the document format, the dominant form for the documents enterprises actually want to query. Other formats (Word, Excel, PowerPoint, email) need their own parsing and structure logic and are covered by follow-up work.

The “amplify the expert” stance translates into concrete architectural choices the series defends, each tied to specific articles:

Deterministic dispatchers over autonomous agents. Experts can audit a deterministic flow. They cannot audit an agent that decides on its own which tool to call, which sub-question to issue, and when to stop. The agent saves engineering effort on the demo and pays it back during incidents that can’t be reproduced because the routing was non-deterministic. The series defends a dispatched architecture where every routing decision is explicit, logged, and inspectable. Article 13 builds it.
Vector stores are a fallback, not a foundation. Experts already know the keywords. The vector store earns its place when keyword retrieval fails: paraphrase, cross-language, polysemy, “vehicle parked at night” matching “car overnight.” It shouldn’t be where retrieval starts. On most enterprise corpora, structure-first retrieval (TOC, classification, expert keywords) outperforms cosine similarity. Articles 2 and 7 develop the case.
Expert dictionaries beat better embedding models. Domain vocabulary is the single most valuable artifact in the system. The synonyms, the disambiguations, the cross-product equivalences (“franchise = deductible”, “ShieldPro Elite = top-tier homeowners plan”) cannot be recovered by an IDF formula or by embedding similarity; they have to be elicited from the people who use the vocabulary every day. Article 6 makes the dictionary the central object of question parsing.
Rerankers are mostly redundant in enterprise RAG. They are worth their cost on one narrow shape (large generic candidate pool, no curated pipeline upstream). The architectural moves the series defends (expert vocabulary, structure-aware retrieval, classify-before-retrieve) make them redundant on the questions that matter. Article 2 bis runs the empirical test.
Refuse the “connect everything to a vector store” pattern. That pattern is optimized for the hyperscaler’s business model, not the customer’s accuracy. Classify before indexing. Filter before retrieving. Aggregate with SQL when the question is statistical. RAG handles content lookup; SQL handles counting; the corpus index sits in between. Articles 14-17 make this the core of the corpus-scale architecture.

Behind those choices sit three positive principles that recur in every article. The work is pragmatic and expertise-driven: every choice gets judged on whether it builds on the accumulated knowledge of the people who already understand the documents. The architecture is pyramidal engineering: four named bricks (parsing, question parsing, retrieval, generation), each one a handful of named functions with explicit inputs and outputs, so a senior engineer can trace a request end-to-end in minutes. The data is relational at every brick: parsing produces tables, question parsing produces tables, retrieval queries them, generation writes a typed row back, never raw strings at any junction.

*One PDF in, eight linked tables out. Every later brick reads from these – Image by author*

Three philosophical positions follow from the above and recur throughout: embeddings are not magic (Article 2), RAG is not machine learning (Article 3), evaluation is per-failure-mode, not aggregate (Article 20).

These positions come from building RAG in regulated industries: insurance, legal, financial services. They aren’t the only valid positions. They’re the ones that have held up in production where a wrong answer triggers a refund, a fine, or a lawsuit.

4. What’s in the series

Part I: What works, what breaks

Build the minimal pipeline, watch where it cracks, reframe the discipline, then locate your own case before going further. Each article sets up the next, so the four can be read in one sitting before tools or frameworks enter the picture.

*The 5×5 case grid from Article 4. Place your problem before picking a technique – Image by author*

Article 1: A Minimal RAG, From PDF to Highlighted Answer. The whole pipeline in ~100 lines. PDF in, structured JSON out, source lines highlighted on the PDF.

Article 2: Embeddings Aren’t Magic. The predictable failure modes of RAG retrieval: negation, exact values, internal acronyms, topical proximity. Where the minimal version starts to break.
Article 2 bis (companion): Rerankers Aren’t Magic Either. Cross-encoder rerankers fix the literal-token traps embeddings collapse, but share the same structural failure modes (negation, exact identifiers, listing, out-of-domain vocabulary). The editorial position: fallback for narrow cases, not a primary stage.
Article 3: RAG Is Not Machine Learning. The misconception that costs RAG projects the most. RAG is search plus a generation layer, not a model to train.
Article 4: Which RAG Technique Fits Which Problem. Diagnostic step before any technical choice. Position your problem on the 5×5 grid (document complexity × question control), then pick the simplest technique that works.

Part II: The four bricks

Parsing → question parsing → retrieval → generation. The four bricks that carry the rest of the series. What sets the architecture apart from generic RAG: every brick produces relational structured data (linked DataFrames, typed rows), never raw strings. The pipeline can be inspected, replayed, and audited at every junction.

*Brick 2 mirrors brick 1: one question, one row, satellite tables for keywords and scope – Image by author*

Article 5: The Rich Output of a Good RAG Parser. Brick 1: lines, tables, images, columns, TOC, cross-references. Everything lost at parsing cannot be recovered downstream.

Article 5 bis (companion): When PyMuPDF Can’t See the Table. Parsing with Azure Document Intelligence. Same eight DataFrames, second engine. Azure adds native table cells, OCR text inside figures, deterministic captions, and a TOC reconstructed from paragraph roles when no native bookmarks exist. The parsing_method column tracks per-row provenance so adaptive parsing can mix fitz and Azure on the same document.
Article 6: Question Parsing in RAG. Structure Before You Search. Brick 2: a question is an unstructured input parsed into a relational set of tables, symmetric to document parsing.
Article 7: Why Embeddings Come Last in Production RAG Retrieval. Brick 3: retrieval is filtering structured DataFrames, not searching free text. Embeddings are the fallback, not the default.
Article 8: Generation as Controlled Execution. Brick 4: typed input (passages plus question), typed output (Pydantic). The schema is the contract; one prompt template per answer shape.

Part III: Pipelines on a single document

The whole pipeline assembled from Part II’s improvements, then extended. Article 1 ran the minimal pipeline end-to-end; Part II then improved each brick in isolation. Article 9 closes that loop: same kind of demo as Article 1, on the same paper, with every Part II improvement wired in together. Articles 10-12 then add specific complexity patterns: adaptive parsing (where generation tells parsing to escalate), cross-references, listing. Article 13 assembles every pattern into the orchestrator, wires the feedback loops that bound iteration, and is where the team’s accumulated wisdom lives.

Article 9: The Full Pipeline, End-to-End, Putting Part II Together. Article 1 ran the pipeline minimally. Part II said how each brick can do better, in isolation. This article runs the same kind of demo as Article 1, on the same Transformer paper, with every Part II improvement wired in: richer parsing, expert-keyword question parsing with typo handling, retrieval methods combined (TOC plus keyword plus embedding with score fusion and an optional LLM arbiter), structured generation with the full schema. The gap between minimal and integrated, shown end-to-end on the same questions.
Article 10: Adaptive PDF Parsing. Cheap parsing first; advanced parsing only where the question demands it. Adaptive escalation driven by generation feedback.
Article 11: How RAG Handles Cross-References in Contracts and Standards. The real challenge of “complex” documents isn’t length, it’s interconnection. Two-hop retrieval that follows references.
Article 12: When RAG Has to Find All the Answers: Listing Questions. “What are all the X?” The answer isn’t in one passage, it’s distributed. Sweep, not top-k, with explicit completeness signals.
Article 13: From One RAG Pipeline to Many: The Composite Pipeline Pattern. Assembling every pattern into a single working system. The orchestrator and dispatcher are the team’s accumulated wisdom in code; bounded feedback loops, drift detection, and the full audit trail live here too.

Part IV: From one document to a whole archive

Naive embedding search over thousands of documents fails. The same four bricks still apply, but each one needs a structural index in front of it. Article 14 sets the thesis with a minimal corpus pipeline run on five NIST PDFs, the kind of baseline that wastes four out of five LLM calls because nothing filtered the corpus first. Article 15 fixes the input side: a hierarchical cascade of questions populates a relational corpus_index, one row per document, columns for the searchable fields. Article 16 formalises the ontology that drives the cascade as five small tables hand-curated by the expert, and explains why a curated relational layer beats an LLM-extracted knowledge graph on every operational axis. Article 17 wires the query side: parse the question, filter the index, run the document-level pipeline only on the candidates the SQL agent returned.

Article 14: Your RAG works on one PDF. Now make it work on ten thousand. Part IV thesis. Five failure modes of naive vector RAG at scale, the mirror principle (4 bricks for one doc → 4 bricks for the corpus), a minimal corpus_qa_baseline run on five NIST PDFs that shows where the waste is.
Article 15: From a folder of PDFs to a queryable RAG corpus, one question at a time. Brick 1 supercharged. A hierarchical cascade of questions populates the corpus_index per document, with two execution paths (regex on filename, single-doc pipeline otherwise) and nomenclature normalisation (raw extraction → canonical entity). Real runs on 24 NIST PDFs and 30 arXiv papers.
Article 16: Why your enterprise RAG needs an ontology, not a knowledge graph. The keystone. The expert’s knowledge codified as five relational tables (cascade rules, concept keywords, concept relations, concept-to-doctype routing, nomenclature). Wins on auditability, cost, maintenance, freshness, ownership. Three sectors (NIST cybersecurity, arXiv NLP/IR, fictional insurance broker) prove the pattern transfers. Anti-GraphRAG is the consequence, not the slogan.
Article 17: How RAG answers a question across a corpus: SQL filter first, retrieval second. Bricks 2-3-4 supercharged. The orchestrator detects intent (column / docs / hybrid), runs the SQL agent or filter-then-retrieve, dispatches generation. Three real runs on the NIST corpus_index close the architecture.

Part V: Operating in production

The system is built. Now run it for years. The code architecture that lets several developers work in parallel, the storage layer that holds the replayable artifacts, per-failure-mode evaluation against a curated dataset (no aggregate-accuracy mirages), cost and latency measured as SQL aggregations on the same storage, and the security envelope wrapping all of it. RAG-specific concerns that generic ML-ops and generic security guides don’t address.

*The four-layer package layout that survives years of evolution. Article 18 draws the function map – Image by author*

Article 18: Code Architecture for Enterprise RAG: Four Layers and a Function Map. The package layout that survives years of evolution. Four layers (core, storage, annotation, pipeline) with unidirectional dependencies, one method per script, and the function map that anchors every brick to its dispatcher and sub-functions.
Article 19: Storage for Enterprise RAG: One Base for Everything You Measure. Around thirty relational tables in five sub-schemas, anchored on two hash-based identifiers (file_id, question_id). Long format for storage, wide views for output. The llm_raw_json column and the query_log table are what evaluation, cost, and audit all read from.
Article 20: Evaluating Enterprise RAG: Measure the Process, Not the Model. Per-failure-mode evaluation as a pandas.groupby on a results table joined from Article 19’s storage. Aggregate metrics lie ; per-question-type metrics tell the truth.
Article 21: Cost and Latency in Enterprise RAG: Measuring from the Storage. Same source tables as Article 20, different aggregations. Tokens, latency, alerts, versioning. Self-hosted Ollama tier-1 benchmark on the broker domain.
Article 22: Security and Compliance for Enterprise RAG. Closing chapter. Prompt injection through documents, tenant isolation, GDPR on derived data, audit trail, document-level access control, self-hosted confidentiality boundary : the enterprise-specific layer generic security guides don’t address.

Bonus articles

Each one is a cross-cutting practical concern that touches several main articles but doesn’t belong inside any single part.

B01: Spelling Variants in RAG. Why Spell-Check Alone Isn’t Enough. Forty years of classical spell-correction (Levenshtein, BK-tree, Soundex, SymSpell) handles most single-word typos. Embeddings and LLMs absorb the rest. The practical split for enterprise RAG: spell-correct the question against the corpus vocabulary at parse time; for documents, clean the canonical references once, leave volume noisy and design retrieval around the noise.
B02: FAQ as RAG. When You Get to Design the Corpus. A controlled-corpus counterpoint to the rest of the series. Standard RAG assumes you inherit a chaotic corpus; FAQ flips it. Parsing becomes trivial, retrieval doubles as a cache, and few-shot prompting itself becomes a retrieval problem. Closes with the feedback loop that turns the FAQ into a living corpus driven by the question stream.
B03: When the RAG Says “I Don’t Know”. Justifying the Absence of an Answer. A confident wrong answer is a bug. A bare “no answer” with no justification is almost as bad. Each of the four bricks owes the user one piece of evidence: what was parsed, which vocabulary was searched, which pages were swept, why nothing matched. The “I don’t know” becomes auditable instead of opaque.
B04: Tables in PDFs for RAG. Don’t Flatten the Grid. Tables are where most RAG pipelines silently fail. A linear decision tree across table types does not work because the dimensions cross. The right pattern is four levels of representation (row-as-line in line_df, separate table_df, columnar with named and typed columns, columnar but heterogeneous), a per-table diagnostic on five orthogonal axes, and a handful of idempotent operations that move tables between levels. Most tables stay at the simplest level; only the few that need it pay the cost of escalation.

Each article stands on its own. Each builds on the previous ones in a way that should feel natural: the same minimal pipeline from Article 1 grows into the architecture of Articles 18-19 and the security envelope of Article 22, with every addition motivated by a specific failure observed earlier.

5. Who this is for

Engineers building RAG systems on enterprise documents. Legal, insurance, financial services, regulated industries broadly, anywhere the cost of a wrong answer is measurable. If you’ve shipped a RAG system that worked on demos and broke on real users, this series is for you.

Data scientists who feel that ML intuitions don’t quite map to RAG. They don’t. The series makes the difference clear and actionable.

Tech leads making architectural decisions. When to use a vector database. When not to. When agentic patterns are worth their cost. When they don’t. When to invest in deeper parsing. The series is opinionated on these calls and explains the reasoning.

6. Who this isn’t for

Teams without internal experts on the documents. The series assumes you have, or can get to, the people who already know your corpus:

lawyers who read the contracts,
underwriters who set the deductibles,
compliance officers who track the regulations.

Almost every architectural choice in the series amplifies that expertise. If you’re building open-domain QA on a corpus nobody internal understands, the choices here will not transfer. There are settings where general-purpose retrieval and autonomous agents make more sense; this series is not about those.

Researchers on the frontier. This series is about production engineering, not novel methods. It cites recent research where relevant but doesn’t try to advance it.

Anyone looking for a magic framework. The series is the opposite. It’s about understanding what’s underneath the frameworks well enough to make deliberate choices. Sometimes that means using a framework. Often it means writing a hundred lines of plain code that work better than what the framework gave you.

7. What this series doesn’t cover

The series focuses on RAG over PDF documents: search and generation for question answering. It doesn’t cover other document formats (Word, Excel, PowerPoint, email), side-by-side document comparison, structured data alongside documents (databases), translation pipelines, large-scale summarization, document generation, or autonomous agents on documents.

These are real enterprise needs. They’re left out because they’re operationally different from RAG-on-PDF. Mixing them in produces the confused architectures the series is trying to help readers avoid.

A follow-up volume, planned for after this one closes, picks up each on its own terms: other document formats (Word / Excel / PowerPoint / email), side-by-side comparison, translation, summarization, structured data alongside documents, document generation. Same engineering discipline, applied to different problem shapes.

8. How to follow the series

The articles will publish daily, in order, starting with Article 1: A Minimal RAG, From PDF to Highlighted Answer.

It builds the entire pipeline in about a hundred lines of Python. It sets up every article that follows by surfacing the questions a working minimal version naturally raises.

If you’re building RAG in production and you think the industry’s defaults are wrong, this series is what to do about it.