Engineering Journal: The Math Behind Context-Aware PDF Extraction
Entry 42: The Blind Extraction Problem
It’s 2 AM and I’m staring at a mangled HTML output.
For the past week, our PDF processor has been using a standard, linear pipeline: rip all the line segments out of the PDF, find intersecting grids, build tables, and then dump all the text into those grids.
It works perfectly on clean, isolated tables. But the moment you feed it a real-world document—a PDF containing paragraphs, headings, and tables mixed together—the pipeline completely loses its mind.
Why? Because the parser is blind. It doesn’t understand context.
If a paragraph has a decorative underline, the LatticeReconstructor spots the horizontal line, assumes it’s a 1-dimensional table border, and hallucinates a phantom table. If text is slightly offset inside a table cell due to PDF rendering coordinate jitter, our strict point-in-box containment logic fails, and the text just vanishes from the final output.
I realized tonight that treating a PDF as a flat bucket of shapes and text will never scale. We have to stop reading the document sequentially and start seeing it spatially.
We need a context-aware architecture.
Entry 43: The Context Classifier and the KD-Tree Solution
I spent the morning ripping out the old sequential flow and designing the contextClassifier.
The goal is simple: before we do any extraction, we walk the document and group every item into spatially bounded, typed regions: TABLE, PARAGRAPH, HEADING, LIST, and IMAGE.
The hardest part was solving the phantom table problem. How do you mathematically differentiate a table border from a decorative underline?
The answer was proximity math.
I implemented a KD-tree style proximity check. Instead of blindly passing all horizontal lines to the table engine, we compare every H-line to every text baseline.
// Proximity check: Does the text sit exactly on top of the line?
for (const h of hSegs) {
const hY = (h.y1 + h.y2) / 2;
for (const tm of textMeta) {
const yDist = hY - tm.vy;
// Underline: line is 0–5px below the text baseline
if (yDist >= -1 && yDist <= 5 && overlappingXSpan(tm, h)) {
underlineSegIds.add(h.id);
break;
}
}
}
If a line is exactly 0 to 5 pixels below a text baseline, it gets tagged as an underline and removed from the table pool. This one geometric heuristic eliminated 99% of our hallucinated tables.
Entry 44: Region Scoping and the Nearest-Cell Metric
Once the LatticeReconstructor builds a true table grid, we calculate its exact bounding box.
Instead of dumping the entire page's text into the table builder, the classifier scoops up only the text items that physically live inside that bounding box.
const tableTextIndices = [];
for (const tm of textMeta) {
if (insideBBox(tm.vx, tm.vy, bbox)) {
tableTextIndices.push(tm.idx);
assignedTextIndices.add(tm.idx); // Mark as consumed!
}
}
By marking text items as "consumed" the moment they are claimed by a region, we guarantee that table text never leaks into paragraphs, and paragraph text never gets sucked into tables.
But what about the coordinate jitter dropping text inside the cells?
I threw out the strict point-in-box checks. They are too brittle for PDF data. I replaced them with a nearest-neighbor proximity model. For every piece of scoped text, we calculate the Euclidean distance to every cell center in the grid. If it's within a 15-pixel threshold, it snaps into the nearest cell like a magnet.
The data loss dropped to zero.
Entry 45: The Page Assembler
It's finally coming together.
I just finished writing the pageAssembler. It takes the array of perfectly classified, non-overlapping regions and sorts them top-to-bottom based on their Y-coordinates.
Then, it iterates through the sorted regions and delegates rendering to the right extractor:
TABLEregions go tobuildTable().HEADINGregions get wrapped in<h3>or<h4>.LISTregions get stripped of their PDF bullet characters and wrapped in clean<ul><li>tags.PARAGRAPHregions go to thetextRebuilder.
sample-tables.pdf through the new pipeline.
The output wasn't a mangled wall of text. It was true document reading order. The HTML was clean, semantically correct, and visually identical to the original PDF structure.
We successfully taught the parser how to see. No backend processing. No AI guessing. Just pure, deterministic spatial math running directly in the browser.
Time to commit.