fastest way to understand what RAG is is to build the smallest version that actually works, run it on a real document, and look closely at what just happened.
That’s this article. About a hundred lines of Python (no vector database, no framework, no agents) running on the Attention Is All You Need paper (Vaswani et al. 2017; arXiv non-exclusive distribution license, declared on the arXiv abstract page), returning a sourced answer with the exact source lines highlighted on the page.
Then we walk back through each block and ask the question it naturally raises. Each question is what a later article develops.
The minimal pipeline is the smallest amount of code that respects the four bricks and produces a verifiable answer. Every later article adds capability the team needs after a specific failure on real documents, not because the architecture needed more layers.
This article is one piece of the broader Entreprise Document Intelligence Vol. 1 series, which builds enterprise RAG brick by brick from a baseline pipeline to corpus-scale architecture.

The pipeline has four bricks (Part II goes into each one in detail) plus a final, optional rendering step. Each brick says what it takes in and what it gives back; what we pass from one brick to the next is what we save.
line_df (one row per text line, with page_num, line_num, text, and the bounding box) plus page_df. The minimal version holds both in memory; bigger systems persist them (Article 23 covers when to move to a database).ParsedQuestion carrying the normalized question plus a short list of checked keywords. It stays narrow on purpose: no retrieval logic here, no question embedding.ParsedQuestion and emits top-k page numbers (and, when needed, the matching line numbers within those pages). Keeping the handoff to page numbers only keeps it small; the next step rebuilds the filtered lines from line_df on the spot. The question embedding lives in this brick because it depends on the corpus index.line_df, and the retrieved page numbers, and produces an AnswerWithEvidence: a typed JSON carrying the answer, the evidence span (start_page, start_line, end_page, end_line), a confidence, a justification, the exact quotes from the source, and any caveats. The full JSON is worth saving for evaluation, audit, and replay.The first four are the four bricks (Article 5 develops document parsing, Article 6 question parsing, Article 7 retrieval, Article 8 generation). PDF annotation is the rendering step, not a brick in itself.

A PDF and a question go in. Each brick turns its input into something more structured: document parsing turns the PDF into rows, question parsing turns the question into search-ready keywords, retrieval cuts the rows down to a few page numbers, generation produces a typed answer, and PDF annotation draws the cited lines back onto the source. What comes out is not a chatbot bubble. It’s a sourced JSON answer plus an annotated PDF you can open and check.
The dependencies are minimal:
base_url the same library serves Azure, OpenRouter, Ollama, or any compatible endpoint.No vector database, no orchestration framework, no specialized RAG library. Later articles look at when those libraries’ helpers become useful, and when they get in the way of seeing what’s going on.
“For a 15-page paper, the LLM can read the whole thing. Why bother with retrieval?” Fair point on this one document. We use the paper to teach the method, not to save tokens on these 15 pages. The objection often points to the Needle in a Haystack benchmark (Kamradt, 2023), where frontier models score near-perfectly retrieving a single verbatim sentence from a 1M-token context.
That benchmark is research, not practice. A needle is one isolated, verbatim fact, while enterprise questions aggregate (“every contract whose deductible exceeds €5,000”), compare (“clause 12 across these three policies”), or summarize across many passages. None of those is a single sentence to find.
Two more practical reasons keep retrieval in the loop. Enterprise documents are often long:
Sending the whole thing to the LLM costs real money on every question, every rerun, every user, and dilutes its attention across irrelevant pages.
And the same question runs across hundreds or thousands of documents at once:
At that scale, “throw it all in” stops being a strategy. Retrieval is what makes the pipeline survive both moves: from one short paper to one long contract, and from one document to a whole corpus.
Each step declares its inputs and outputs, and the steps are independent. The output of step N is the input of step N+1, saved as a named DataFrame so any step can be re-run on its own against the saved output of the previous one. In the AI-coding era, an assistant told to “fix retrieval” can quietly modify the question parser when it should have stayed untouched. Independent modules are how you work confidently on one piece without breaking the rest.
The setup chunks below load them alongside the OpenAI client.
Every brick that talks to a model needs a configured client. The series uses OpenAI’s Python SDK; any provider that exposes an OpenAI-compatible endpoint (Azure OpenAI, vLLM, llama.cpp’s --api-server, …) drops in by changing base_url and the model name.
import os
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
client = OpenAI(
api_key=os.getenv("API_KEY"),
base_url=os.getenv("BASE_URL"),
)
model_chat = os.getenv("MODEL_CHAT", "gpt-4.1")
model_embed = os.getenv("MODEL_EMBED", "text-embedding-3-small")
We extract every text line of the PDF along with its position on the page. The output is a DataFrame where each row is one line, with page_num, line_num, the text itself, and the four bounding-box coordinates x0, y0, x1, y1.
- In: a PDF path.
- Out:
line_df(one row per text line, withpage_num,line_num,text, and the bounding box) plus apage_dfwe’ll build in section 2.3.
The bounding boxes matter: they’re what we use to draw highlights on the source PDF at the end.
def fitz_pdf_to_line_df(file_path):
doc = fitz.open(file_path)
data = []
for page_num in range(len(doc)):
page = doc[page_num]
blocks = page.get_text("dict").get("blocks", [])
line_num = 0
for block in blocks:
if block.get("type") != 0:
continue
for line in block.get("lines", []):
spans = line.get("spans", [])
if not spans: continue
text = "".join(s["text"] for s in spans)
rect = fitz.Rect(spans[0]["bbox"])
for span in spans[1:]:
rect |= fitz.Rect(span["bbox"])
data.append({
"page_num": page_num + 1,
"line_num": line_num + 1,
"text": text,
"x0": float(rect.x0), "y0": float(rect.y0),
"x1": float(rect.x1), "y1": float(rect.y1),
})
line_num += 1
return pd.DataFrame(data)
Running line_df = fitz_pdf_to_line_df(pdf_path) on the Attention paper returns 1048 lines across 15 pages.

The paper, turned into rows. Each line is one row, with its text and the four numbers that locate it on the page. The x0, y0, x1, y1 columns don’t mean much yet; in section 2.5 they’re what we use to draw rectangles on the source PDF, exactly over the lines the model cited.
This DataFrame, line_df, is the core data structure of the rest of the series. Article 5 introduces a richer relational model around it (line_df, chunk_df, toc_df, page_df, image_df).
What this parser doesn’t do: detect tables (Table 1 page 4, Table 3 page 9 flatten into plain lines), reconstruct headings, footnotes, cross-references, or handle multi-column layouts. None of this matters for the question we ask here. For other questions on the same paper, it will. Article 5 covers parsing in full.
Before the question goes to retrieval, we run it through a tiny LLM call. The goal is to extract the keywords most useful for searching the document: short phrases the document is likely to use, not necessarily the literal words of the question.
- In: a text question.
- Out: a
ParsedQuestionholding the normalized question and a short list of checked keywords.
This step does not know about retrieval. It does not compute the question embedding either. That one is tied to the corpus index and lives in section 2.3. Keep that line clean and you can swap the embedding model or add a hybrid retriever tomorrow without touching question parsing.
Why bother on a minimal pipeline? Two reasons:
line_df. This subsection parses the question into ParsedQuestionMinimal. Both inputs deserve to be parsed before they hit the search step. Article 6 builds the richer brick (parse_question, with answer shape, scope filters, decomposition, …).On the question “What are the options mentioned for positional encoding?”, the call parsed_question = get_keywords_from_question(question, client=client) returns parsed_question.keywords = ['positional encoding', 'options', 'mentioned'].
question = "What are the options mentioned for positional encoding?"
parsed_question = get_keywords_from_question(question, client=client)
print(parsed_question.keywords)
['positional encoding']
The LLM produces a single, literal phrase like ['positional encoding']. That’s deliberate. An earlier draft of this prompt asked for “3 to 5 short keywords useful for searching”, and the LLM happily filled the quota with paraphrases (positional encoding options, types of positional encoding, transformer positional encoding). None of those are written in the document. Only positional encoding is. Substring matching is strict: a single missing word kills the match. The minimal version asks the LLM to do less (extract the literal noun phrase, drop the question framing) and trusts the next block to do the rest.
What this minimal version doesn’t do:
answer_shape (Q&A vs summarization)All covered in Article 6, under the richer parse_question brick. Here we keep two fields, corrected_question and keywords, the smallest version that makes the brick visible.
Note: overriding the system prompt.
get_keywords_from_questionexposes the system prompt as a kwarg withKEYWORDS_PROMPTas default. To test a variant (different domain, stricter rules, extra examples), passsystem_prompt=...at the call site. No edit to the function. Same pattern for every LLM helper indocintel(llm_answer_with_evidenceexposes bothsystem_promptanduser_template). Below: the same call, run twice on a contract-style question. First with the research-paper default, which stays generic. Then with a contract-domain prompt, which picks up insurance vocabulary likeexclusions,deductible.
demo_question = "Are earthquakes excluded from coverage?"
# Default: research-paper prompt.
parsed_question_default = get_keywords_from_question(demo_question, client=client)
print("Default (research-paper):", parsed_question_default.keywords)
# Override: insurance / legal contract prompt.
contract_prompt = (
"Extract 1 to 3 short keywords from the user question for searching an "
"insurance contract or legal policy. Prefer literal terms the contract is "
"likely to use: clauses, exclusions, named perils, deductibles, caps. Drop "
"question framing words. Output 1 to 3 keywords."
)
parsed_question_contract = get_keywords_from_question(
demo_question, system_prompt=contract_prompt, client=client,
)
print("Contract prompt: ", parsed_question_contract.keywords)
Default (research-paper): ['earthquakes', 'coverage']
Contract prompt: ['earthquakes', 'exclusions', 'coverage']
Sending all 1048 lines to the LLM works on a paper this size but does not scale and dilutes the model’s attention. We cut the document down to the few pages most likely to contain the answer.
- In: the checked keywords (and/or the normalized question, depending on the method) from section 2.2.
- Out: the top-k page numbers, plus optionally the matching line numbers within those pages.
The question embedding is computed here, not in section 2.2, because an embedding only makes sense relative to the index it was built on. Same logic for any hybrid scoring or BM25 statistics.
The standard answer in 2024 RAG tutorials is embeddings: turn each page into a vector, score by cosine similarity. Article 2 is dedicated to them. For the minimal version, we deliberately don’t, for one reason.
Embeddings are opaque. Cosine similarity returns a number like 0.7798 and asks the user to trust that “page 6 is relevant to the question”. Show that score to a domain expert, a product owner, or a manager: nobody understands what 0.78 means, or why it’s higher than 0.65. Developers may argue they understand it (“dot product of normalized vectors”). They understand the math, not the relevance. Asked why this specific page scored 0.7798 against this specific question, they shrug and point at the model.
In an enterprise context, retrieval is the step users question the most. Why did the system look at this page and not that one? You have to explain it. So the minimal version uses something we can read with our own eyes: keyword matching. Section 2.2 pulled the keywords; we score each page by how many of those keywords appear in it, and keep the top three.
Where we search vs what we return: both pages here. Real retrieval has two levels. The anchor is where the keyword or embedding actually hits (a line, a sentence). The context is what we hand to generation (the lines around it, the page). We search small, we return big. Here we use the page for both. That works on an academic paper where each page is roughly one idea. Article 7 separates the two levels for long contracts, multi-column reports, table-heavy documents.
page_df = build_page_df(line_df) collapses the 1048 lines into 15 pages, one row per page.

Embed every page (one call per page), embed the question, compute cosine similarity, keep the top-k. The output: a number like 0.7798 per page. Look at the scores below: can you tell why a page made the top three? Could you explain the ranking to a domain expert? That’s the opaque-score problem the article opens with.

Three numbers, all very close to each other (0.7843, 0.7798, 0.7728). Can you say why page 9 beats page 6? The text preview makes it obvious: page 9 is the Variations on the Transformer architecture table, page 5 is about output values and concatenation, page 6 is the Maximum path lengths table. The page that actually answers the question, section 3.5 Positional Encoding, sits on page 6 and ranks last in the top three. The unrelated page 5 ranks second. The scores look precise, but the ranking has no story behind it: there is no token to point at, no phrase to defend, just a dot product on two black-box vectors. Embeddings work in many cases, and Article 2 unpacks where this score comes from. But the score itself never becomes interpretable, and for the rest of this article we use a retriever you can read with your own eyes.
For each page, count how many of parsed_question.keywords appear in it (case-insensitive substring match). Drop pages with zero matches; keep the top-k by match count. The output table below carries the actual matched_keywords per page, so anyone can read it and see why a page was picked.
retrieve_pages(page_df, line_df, parsed_question.keywords, top_k=3) returns the top three pages by keyword count plus the filtered lines: 314 lines kept from pages 6, 9, 7.

Three pages, ranked by match count, with the actual matches laid out. Pages 6, 8, and 9 each contain the literal phrase positional encoding; page 6 holds Section 3.5 Positional Encoding with the actual answer. Anyone reading the table can verify the result by hand: search the source for positional encoding and you’ll find these three pages.
Two design choices:
nlargest returns. The downstream LLM sees the lines from all tied pages in document order and decides.From 1048 lines to 300, and we know the right material is in there.
def cosine_sim_matrix(query_vec, doc_matrix):
q = query_vec / (np.linalg.norm(query_vec) + 1e-12)
d = doc_matrix / np.linalg.norm(doc_matrix, axis=1, keepdims=True)
return d @ q
def retrieve_pages(page_df, line_df, question, top_k=3):
q_vec = np.asarray(get_embedding(question), dtype=np.float32)
doc_matrix = np.vstack(page_df["embedding"].values)
sims = cosine_sim_matrix(q_vec, doc_matrix)
scored = page_df.copy()
scored["similarity"] = sims
retrieved_pages_df = scored.nlargest(top_k, "similarity")
kept_pages = retrieved_pages_df["page_num"].tolist()
filtered_line_df = line_df[line_df["page_num"].isin(kept_pages)]
return retrieved_pages_df, filtered_line_df
Note: the “split into individual words” trap. A natural reflex when the multi-word phrases don’t match: split them and search for the individual tokens. Below we expand every keyword into its words, deduplicate, then re-run retrieval. We get matches, and we also get false positives, because words like
encoding,transformer,networkappear all over the document in unrelated contexts.
Now every page in the top three matches several tokens, but look at which tokens. Words like
encodingandtransformercover most of the paper. Pages about layer encoding or encoder stacks look as relevant as the page that actually answers the question. Splitting trades one failure (zero matches) for another (false positives). Article 7 covers the real fixes (synonym expansion through a dictionary, hybrid scoring); for now, keep the phrase whole.
Same pipeline, a different question. We ask about the value of epsilon used in label smoothing. The answer is on page 8 of the paper, written as ε_ls = 0.1 (Greek letter ε, never the English word epsilon). Watch what each retriever does.
question_2 = "What is the value of epsilon used in label smoothing?"
parsed_question_2 = get_keywords_from_question(question_2, client=client)
print("Keywords:", parsed_question_2.keywords)
Keywords: ['epsilon', 'label smoothing']
Two failures of different shapes:
ε_ls = 0.1 lives) may or may not be in the top three. Pages dense in math notation come up even when they’re unrelated.epsilon, label smoothing, etc. The document writes the Greek letter ε. Substring match returns zero on anything that mentions epsilon by symbol only. The page that contains the answer is invisible to the keyword retriever.Section 4.4 picks this up as the bridge to Article 2 (Embeddings handle synonyms and surface variation) and Article 6 (richer Question Parsing pulls in alternatives like the Greek letter).
We send the retrieved lines to the LLM with the question, formatted as a tab-separated block where page_num and line_num sit next to each line. That format gives the LLM the exact coordinates it needs to cite.
- In: the original question,
line_df, and the retrieved page numbers from section 2.3.- Out: an
AnswerWithEvidence, a structured JSON with the answer, the evidence span (start_page_num,start_line_num,end_page_num,end_line_num), a confidence, a justification, the exact quotes, and any caveats.
class AnswerWithEvidence(BaseModel):
answer: str = Field(...)
start_page_num: int | None
start_line_num: int | None
end_page_num: int | None
end_line_num: int | None
confidence: float = Field(..., ge=0.0, le=1.0)
justification: str = Field(...)
quotes: list[str] = Field(default_factory=list)
caveats: list[str] = Field(default_factory=list)
The raw JSON is worth saving in production: justification, quotes, caveats, and confidence all feed evaluation, audit, and replay, well beyond the answer field a chat UI shows.
We serialize the filtered lines into a TSV with header page_num\tline_num\ttext, one row per line. The LLM sees the exact coordinates next to each text fragment so it can cite by (page_num, line_num) in its answer.
This is what makes the answer grounded: the schema forces the model to fill in (start_page, start_line, end_page, end_line), a verbatim quote, and caveats if anything is uncertain. No prose, only a typed object with citations.
We call answer = llm_answer_with_evidence(question, filtered_line_df, client=client) and get back an AnswerWithEvidence instance, rendered below as a styled JSON image so the field labels stay legible.
def llm_answer_with_evidence(question, filtered_text_prompt):
resp = client.responses.parse(
model=model_chat,
input=[
{
"role": "system",
"content": (
"Answer using ONLY the provided lines. "
"Return JSON only."
),
},
{
"role": "user",
"content": (
f"Lines:\n{filtered_text_prompt}\n\n"
f"Question:\n{question}\n\n"
"Pick a contiguous evidence span."
),
},
],
text_format=AnswerWithEvidence,
store=False,
)
return resp.output_text
We call answer = llm_answer_with_evidence(question, filtered_line_df, client=client) and get back an AnswerWithEvidence instance.
{
"answer": "The options for positional encoding mentioned are learned positional embeddings and fixed positional encodings (specifically, using sine and cosine functions of different frequencies).",
"start_page_num": 6,
"start_line_num": 31,
"end_page_num": 6,
"end_line_num": 32,
"confidence": 0.98,
"justification": "Lines 31–32 explicitly state: 'There are many choices of positional encodings, learned and fixed [9].' Additionally, further lines detail the sinusoidal encoding as the fixed choice, and Table 3 row (E) discusses using learned embeddings instead.",
"quotes": [
"There are many choices of positional encodings, learned and fixed [9]."
],
"caveats": [
"Further details about the specific implementation of learned embeddings are only touched on elsewhere, but both options are mentioned here."
],
"complete_answer_found": true,
"context_structured": true,
"llm_discovered_keywords": [
"learned positional embeddings",
"fixed positional encodings",
"sinusoidal positional encoding"
]
}
Three things happened that matter:
(page, line) range we can verify.If the model can’t fill the schema, null fields are allowed and caveats records why. Article 8 develops the schema into a much richer form with per-brick feedback fields; Article 23 builds the storage architecture around it.
Sanity check. On a paper this short we can also send the entire line_df to the LLM with no retrieval and check the answer matches. Reassuring here, won’t scale to large documents.
{
"answer": "The options mentioned for positional encoding are sinusoidal positional encodings (using sine and cosine functions of different frequencies) and learned positional embeddings.",
"start_page_num": 6,
"start_line_num": 27,
"end_page_num": 6,
"end_line_num": 41,
"confidence": 0.99,
"justification": "Lines 6:27-6:41 describe adding 'positional encodings' to the input embeddings, specify the sinusoidal method, and mention experimenting with learned positional embeddings, stating both options were tried and produced nearly identical results.",
"quotes": [
"Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add 'positional encodings' to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodel as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed [9]. In this work, we use sine and cosine functions of different frequencies: ... We also experimented with using learned positional embeddings [9] instead, and found that the two versions produced nearly identical results (see Table 3 row (E)). We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training."
],
"caveats": [
"Exact mathematical formulas for sinusoidal encoding are present here, but full details for learned embeddings are not. Table 3 row (E) and further details may expand on results but are not needed for the options question."
],
"complete_answer_found": true,
"context_structured": true,
"llm_discovered_keywords": [
"sinusoidal positional encoding",
"learned positional embeddings",
"sine and cosine functions",
"relative or absolute position"
]
}
Now the satisfying part. We use the evidence span to draw rectangles directly on the source PDF.
- In: the source PDF and the evidence span from the
AnswerWithEvidence.- Out: an annotated PDF with rectangles drawn around the cited lines.
- Optional. A CLI tool, a batch job, or an API may skip it; the answer with citations is already complete after section 2.4.
Three calls do the work:
passage_lines_df_from_answer(line_df, answer) rebuilds the cited-line DataFrame from the evidence span.passage_bbox_by_page(passage_df) groups bounding boxes per page.draw_passage_rectangles(pdf_path, bboxes_df, out_pdf_path) writes the annotated PDF.

def passage_lines_df_from_answer(line_df, answer_json):
a = json.loads(answer_json)
sp, sl = a["start_page_num"], a["start_line_num"]
ep, el = a["end_page_num"], a["end_line_num"]
if sp is None: return line_df.iloc[0:0]
mask = (
line_df["page_num"].between(sp, ep)
& ((line_df["page_num"] != sp) | (line_df["line_num"] >= sl))
& ((line_df["page_num"] != ep) | (line_df["line_num"] <= el))
)
return line_df.loc[mask].copy()
def passage_bbox_by_page(passage_df):
return passage_df.groupby("page_num", as_index=False).agg(
x0=("x0", "min"), y0=("y0", "min"),
x1=("x1", "max"), y1=("y1", "max"))
def draw_passage_rectangles(pdf_path, bboxes_df, out_path):
doc = fitz.open(pdf_path)
for _, r in bboxes_df.iterrows():
page = doc[int(r["page_num"]) - 1]
page.add_rect_annot(fitz.Rect(r["x0"], r["y0"], r["x1"], r["y1"]))
doc.save(out_path)

The passage really is where the answer comes from. The red box wraps the Positional Encoding paragraph: the sentence that introduces the choice (“we use sine and cosine functions of different frequencies”) and the two-line formula directly below it. The reader can move from the chat answer to the citation to the source paragraph without leaving the same screen. That’s the whole point.
Why a box around the whole paragraph and not the exact words? Because we worked at the line granularity: line_df carries one bounding box per text line, the LLM cites a (start_line, end_line) span, and passage_bbox_by_page collapses every line in that span into one wrapping rectangle. If you want to draw the box around the exact words sin(pos / 10000^(2i/d_model)) instead of the whole paragraph, the approach is the same. Just change the granularity. Replace line_df with a word-level word_df (PyMuPDF’s page.get_text("words") gives you a bounding box per word), make the schema cite (start_word, end_word), and passage_bbox_by_page already does the right thing. Same four-brick pipeline, finer scope.
The bricks chain into a single call. Feed in a PDF and a question; get back a typed answer with line citations, and optionally an annotated PDF.
- In: a PDF path and a text question (plus an optional
top_kand an optional output PDF path).- Out: an
AnswerWithEvidence, and (ifannotate_pdfis given) an annotated PDF on disk.
Inside, pdf_qa_baseline chains document parsing → question parsing → retrieval → generation → PDF annotation. What crosses the retrieval → generation boundary is just the page numbers; the filtered line_df is rebuilt inside generation.
def pdf_qa_baseline(
pdf_path: str,
question: str,
top_k: int = 3,
annotate_pdf: str | None = None,
):
# 1. Parsing
line_df = fitz_pdf_to_line_df(pdf_path)
# 2. Retrieval
page_df = embed_page_df(build_page_df(line_df))
_, filtered = retrieve_pages(page_df, line_df, question, top_k)
# 3. Generation
answer = llm_answer_with_evidence(question, filtered)
# 4. Optional highlighting on the source PDF
if annotate_pdf is not None:
passage = passage_lines_df_from_answer(line_df, answer)
bboxes = passage_bbox_by_page(passage)
draw_passage_rectangles(pdf_path, bboxes, annotate_pdf)
return answer
{
"answer": "The options mentioned for positional encoding are learned and fixed positional encodings, specifically sinusoidal positional encodings (using sine and cosine functions of different frequencies) and learned positional embeddings.",
"start_page_num": 6,
"start_line_num": 31,
"end_page_num": 6,
"end_line_num": 41,
"confidence": 0.99,
"justification": "Lines 31-41 discuss the choices for positional encodings, stating that there are many choices including learned and fixed encodings. It then explains the use of sine and cosine functions (sinusoidal encoding) and notes that learned positional embeddings were also experimented with.",
"quotes": [
"There are many choices of positional encodings, learned and fixed [9].",
"In this work, we use sine and cosine functions of different frequencies: ...",
"We also experimented with using learned positional embeddings [9] instead, and found that the two versions produced nearly identical results (see Table 3 row (E))."
],
"caveats": [],
"complete_answer_found": true,
"context_structured": true,
"llm_discovered_keywords": [
"positional encodings",
"learned",
"fixed",
"sinusoidal",
"sine and cosine functions",
"learned positional embeddings"
]
}
This is the API of the article. Later articles build a sister function ask_corpus(question, corpus, ...) for archive-scale work: same contract (typed answer with citations), different scope (filter the corpus first, then run document-level work on the matching documents).
Drop in any PDF you have around: a paper from your own field, a contract, a report from work. Here we pick the World Bank’s April 2026 Commodity Markets Outlook (World Bank publication, April 2026 issue; CC BY 3.0 IGO, as declared on the World Bank Open Knowledge Repository publication page for this issue): a 69-page report on energy, agriculture, and fertilizer markets, far from a research paper in tone and structure.
Same four bricks, same default prompts, same retrieve_pages, same schema. Nothing about the pipeline changes for a new document.
We start with a question whose answer lives deep in the report, in the metals chapter rather than the Executive Summary: the outlook for aluminum prices in 2026.
We call pdf_qa_baseline end-to-end: pass the CMO PDF, the aluminum question, top_k=3, and an annotate_pdf path so the pipeline also writes the highlighted source. The returned answer_cmo_al is the same AnswerWithEvidence shape we saw on the Attention paper.
{
"answer": "Aluminum prices are projected to rise by about 22 percent in 2026 (y/y) to reach an all-time high—about 21 percent higher than their January 2026 projections—supported by tight supply conditions and solid demand growth. Prices are expected to decline by about 6 percent in 2027 as supply conditions gradually ease.",
"start_page_num": 45,
"start_line_num": 32,
"end_page_num": 45,
"end_line_num": 43,
"confidence": 0.98,
"justification": "The selected span explicitly provides the projected percentage increase for aluminum prices in 2026, the context for these movements, and the outlook for 2027. It also mentions the record-high level forecast and factors driving the price.",
"quotes": [
"Aluminum prices are projected to rise by about 22 percent in 2026 (y/y) to reach an all-time high—about 21 percent higher than their January 2026 projections—supported by tight supply conditions and solid demand growth (table 1).",
"Prices are expected to decline by about 6 percent in 2027 as supply conditions gradually ease."
],
"caveats": [],
"complete_answer_found": true,
"context_structured": true,
"llm_discovered_keywords": [
"all-time high",
"tight supply conditions",
"solid demand growth"
]
}
The composite view places the highlighted source page next to the question and the answer, so the citation can be checked at a glance:

A harder question on the same report. What if we ask about something the report mentions only in passing? We try the AI-related electricity demand question, whose answer the World Bank developed only in an “Upside risk” sidebar on page 31.
Same call shape, harder question: pdf_qa_baseline(pdf_path=pdf_path_cmo, question=question_cmo_ai, top_k=3, ...). The pipeline must decide whether the retrieved pages actually carry the AI-electricity figure or whether to flag the answer as not found.
{
"answer": "The provided lines mention that faster-than-anticipated expansion of AI-related data centers could boost demand for certain metals like aluminum and copper, but do not quantify the contribution of AI-related data centers to global electricity demand growth.",
"start_page_num": 47,
"start_line_num": 39,
"end_page_num": 47,
"end_line_num": 40,
"confidence": 0.8,
"justification": "The only mention of AI-related data centers is in relation to demand for metals, not electricity demand. There is no quantitative estimate or percentage given for their impact on global electricity demand growth.",
"quotes": [
"Also, faster-than-antici-\npated expansion of AI-related data centers could \nboost demand for aluminum and copper, driving \nprices higher."
],
"caveats": [
"No specific figures or direct statements about global electricity demand growth caused by AI-related data centers were found in the provided lines."
],
"complete_answer_found": false,
"context_structured": true,
"llm_discovered_keywords": [
"AI-related data centers",
"electricity demand growth",
"boost demand for aluminum and copper"
]
}

But how can we be sure the answer really doesn’t exist in the document? Strictly, we can’t, at least not from this null path alone. What the schema says is “the LLM didn’t find the answer in the lines it was shown”, which is a different claim from “the answer is not in the document”. The Upside-risk sidebar on page 31 of the same CMO report does quantify the figure (the World Bank cites the IEA’s 8% projection of global electricity demand growth from 2024 to 2030). The default keyword pipeline pulled page 47 and nearby pages instead, where the report’s prose discusses AI’s effect on metal demand. Proving absence would require either running the LLM on every page, or a retrieval method that surfaces sidebar text and short reference mentions. That’s exactly what Article 7 (Retrieval) develops; for the minimal version, “I didn’t find it in the top three pages” is what we report.
A small batch of four questions on the same two documents, all results in one table. Read the table for patterns, not for every cell.
def run_pipeline_test(
question: str,
line_df_in: pd.DataFrame,
page_df_in: pd.DataFrame,
page_df_emb_in: pd.DataFrame,
top_k: int = 3,
client=client,
) -> dict:
"""Run both retrievers + generation on one question; return a summary dict."""
parsed_q = get_keywords_from_question(question, client=client)
retrieved_emb_df, _ = retrieve_pages_by_similarity(
page_df_emb_in, line_df_in, question, top_k=top_k, client=client,
)
retrieved_kw_df, filtered_lines_kw = retrieve_pages(
page_df_in, line_df_in, parsed_q.keywords, top_k=top_k,
)
# If keyword retrieval finds nothing, fall back to the whole doc so generation
# still runs (small PDFs only: would not scale to a real corpus).
lines_for_generation = (
filtered_lines_kw if len(filtered_lines_kw) > 0 else line_df_in
)
answer = llm_answer_with_evidence(
question, lines_for_generation, client=client,
)
return {
"question": question,
"keywords": parsed_q.keywords,
"emb_top3": retrieved_emb_df["page_num"].tolist(),
"kw_top3": (
retrieved_kw_df["page_num"].tolist()
if len(retrieved_kw_df) > 0 else "(no kw match)"
),
"answer_excerpt": (answer.answer[:80] + ("..." if len(answer.answer) > 80 else "")),
"cite_page": answer.start_page_num,
}

Read the table left-to-right per row. Four patterns to take away:
learning rate. Same lesson as the epsilon row in section 2.3.c: when the question depends on a precise term the document prints verbatim, keywords are the better tool.(no kw match) outright, with no false ‘top-3 pages’ that look plausible. The schema then returns a null answer with a caveat. A clean ‘I don’t know’ is the system’s most valuable behavior on out-of-scope questions.d_model, h, d_k, d_v, etc. Our parser flattened the table into plain lines, so a model that asks for two cells side by side has to reassemble the row from text alone. Keywords retrieve page 4 (the literal phrase d_k appears there), but the citation often points to one value while the other is paraphrased. The fix is structural: parse tables as tables, not as lines. That’s Article 5 (parsing) and Article 6 (compound-question decomposition) doing their job.What this minimal system does well:
caveats field says why. No fabrication.Now look at the same system again. Each block hides assumptions worth questioning.
We extracted text line by line. That’s reasonable for an academic paper, but look at what we threw away: section structure, headings, table layouts, figures, footnotes, cross-references. Page 4 of this paper contains Table 1 with the per-layer complexities. We parsed each of its rows as plain lines, losing the table structure entirely. Page 9 contains Table 3, the ablation study. Same problem.
For a question like “What are the options for positional encoding?” this doesn’t matter. The answer is in continuous prose. For a question like “What is the per-layer complexity of self-attention?” it suddenly does, because the answer lives in a table cell that our parser flattened into noise.
That’s the topic of Article 5: Parsing. Documents have structure. Ignoring it is the single biggest source of downstream failure.
Our question-parsing step extracts a flat list of keywords. That works on a clean question against an academic paper. It starts to break down as soon as questions get harder.
Three things this minimal version doesn’t do.
It doesn’t detect intent. “Summarize chapter 3”, “Translate this clause into French”, “Compare X and Y” each call for a different downstream pipeline. A single keywords field can’t carry that signal.
It doesn’t decompose compound questions. “What are the exclusions and the deductible?” parsed as a flat keyword list pollutes the retrieval (the keywords for “exclusions” and “deductible” pull in two different scopes that interfere). Article 6 walks through how to detect compound questions, decide whether to decompose, and route the sub-questions independently.
It doesn’t detect an expected answer shape. “What is the premium amount?” wants a number with a currency. “What are the obligations?” wants a list. “Compare the two policies” wants a table. The minimal version treats every answer as free text. Article 6 introduces the expected_answer_shape field that drives the generation template downstream.
That’s the topic of Article 6: Question Parsing. The same brick, much richer JSON.
We chose pages as the unit of retrieval. Why pages? Why not paragraphs, or sections, or fixed-size chunks of 512 tokens like every standard RAG tutorial recommends?
The answer is that page-level aggregation happens to work for this paper because pages roughly align with semantic units. On a contract, on a legal text, on a technical manual with numbered clauses, pages are arbitrary cuts and you’d want clause-level or section-level chunks instead. The “right” chunking depends on the document and the question, not on a default value.
The temptation, when a fixed-size approach starts failing, is to grid-search over chunk sizes and overlaps. That’s the machine learning reflex. It’s the wrong frame for what’s actually a structural decision. Article 3: RAG Is Not Machine Learning, and the Six-Month Mistake of Treating It Like One makes that case in full.
Our retrieval just worked. Page 6 came back with the matched keyword, ahead of the rest, and the Positional Encoding section is on page 6. Anyone can look at the match table and see why. That’s the trade we made: the simplest possible retrieval, completely auditable.
The trade has a cost. Keyword matching is blind whenever the question’s vocabulary doesn’t match the document’s. Three failure modes show up immediately on the same paper.
Symbol vs word. Ask “What is the value of epsilon used in label smoothing?” The keywords from question parsing are likely something like ["epsilon", "label smoothing"]. The actual answer (ε_ls = 0.1) sits on page 8, but the document writes it as the Greek letter ε, never the English word “epsilon”. The substring check returns zero on the symbol-only page; only the literal phrase label smoothing lands on page 8.
Synonym mismatch. Ask “How does the model know the order of words in a sentence?” The keywords might be ["word order", "sentence order"]. The document calls this positional encoding. None of the question’s keywords appear on page 6. The retriever picks pages that happen to mention “order” or “sentence” in passing, none of which contain the answer.
Paraphrase. Ask “What attention mechanism does the encoder use?” The document says self-attention and Multi-Head Attention, never the phrase “attention mechanism the encoder uses”. The keywords pulled from the question, even after expansion, may or may not include the document’s exact phrasing. When they do, retrieval works. When they don’t, it silently degrades.
The first two failures are so common that the rest of the series spends two articles on them.
The right answer is to combine, not pick a winner. The two methods fail on almost opposite cases: embeddings stumble when the question depends on a precise symbol, named term, or exact value; keywords stumble when the asker’s vocabulary doesn’t literally appear in the document. Running both retrievers, taking the union of their candidates, and (optionally) re-ranking with a cross-encoder is the standard hybrid recipe. Article 2 develops it; Articles 7 and 9 wire it into a corpus.
The minimal version stays single-retriever because it teaches the right reflex first: the retriever must be auditable. Keyword matching makes that reflex concrete (you can see exactly which words landed on which page). Once that reflex is in place, embeddings become a controlled addition rather than an opaque default, and combining the two becomes a deliberate engineering choice rather than a trend.
This is the block that worked best, almost too easily. We defined a Pydantic schema with start_page_num, start_line_num, end_page_num, end_line_num, confidence, justification, quotes, and caveats, and the model filled it in correctly.
How much more can we ask? A structured comparison for comparative questions, a list of conflicts if the document contradicts itself, multiple citations from multiple parts of the document, a confidence breakdown per claim. Yes to all of the above. The generation step is far more controllable than most teams realize. Article 8: Generation as Controlled Execution explores this in depth.
This minimal pipeline is the spine of everything that follows. Each part of the series goes deep on one of the questions raised above.
The mistakes that kill most projects come from getting the wrong picture of one of these blocks: RAG isn’t ML (Article 3), embeddings aren’t magic (Article 2), not all RAG problems look the same (Article 4). That’s Part I.
Each brick then gets its own deep dive: document parsing, question parsing, retrieval, generation. That’s Part II, the four bricks.
Once the blocks are solid, we recombine them for cases that look like production: long documents, justification and absence handling, table-of-contents-driven retrieval, listing questions, structured extraction, the composite pipeline. That’s Part III.
Then we change scale. From one document to many. From a single paper to an archive of hundreds or thousands of documents. The architecture changes substantially. That’s Part IV.
Finally, what it takes to operate the system in production: evaluation, cost and monitoring, security and compliance, the architecture of the codebase itself. That’s Part V.
The blocks don’t change. Their internals do.
A few framing notes:
Here’s the explicit map from this minimal system to the rest of the series:
You can read this

A hundred lines of Python and a Pydantic schema are enough to ship a working RAG system on a real PDF. What makes the system trustworthy is not the line count : it is the structured answer with line-level citations, the schema’s null path that refuses to fabricate, and the PDF highlight that ties every claim back to its source. The four bricks (parsing, question parsing, retrieval, generation) are the conceptual core ; everything that follows in the series is about doing each one better.
The minimal version is a baseline, not a destination. The next article tackles the misconception that wrecks the most RAG projects : that RAG is a machine learning problem. It is not.
The structured-output-with-citations framing this article uses for AnswerWithEvidence is the same direction as Bohnet et al. (Attributed Question Answering, 2022). The full production-grade equivalent of this kind of pipeline shows up in Anthropic’s Contextual Retrieval (Sept 2024), which Article 9 will preview. The term RAG itself comes from Lewis et al. (2020). Volume 3 (Agentic Bricks) returns to the agentic upgrade path on top of the four bricks defined here.
Same direction as the article:
AnswerWithEvidence schema.Different angle, different context: