How to Stop PDF Parsers from Hallucinating Tables out of Thin Air
PDF extraction is usually blind.
If you've ever tried to write a script to scrape a PDF, you know exactly what I mean. You run the PDF through a generic text extractor, and instead of a clean table, you get a jammed wall of text where the columns are violently shoved into a single vertical stack.
Or worse, you try to use a table extractor, and it hallucinates tables everywhere. See a bold heading with an underline? The parser thinks that's a 1x1 table. See a horizontal divider between paragraphs? Boom, phantom table.
Why does this happen? Because most PDF parsers process the document in a strict, sequential pipeline. They look at all the lines. They look at all the text. And they just smash them together.
I got tired of this. So I re-engineered the extraction pipeline in our PDF processor to stop reading the document like a machine, and start seeing it like a human.
Here is the math behind Context-Aware PDF Extraction.
1. The Blind Extraction Problem
Previously, our extraction pipeline worked like this:
- Find all horizontal and vertical line segments (
H-segsandV-segs). - Run them through a
LatticeReconstructorto find intersecting grids. - Treat every grid as a table.
- Dump all the text in the document into those grids using a strict "is this point inside this box" check.
If a paragraph had a decorative underline, the LatticeReconstructor would see the H-line, panic, and try to build a table out of it. If text was slightly offset inside a table cell due to coordinate jitter, the "point-in-box" check would fail, and the text would just vanish from the output.
I needed the parser to understand context.
2. Enter the Context Classifier
To fix this, I built the contextClassifier.
Instead of treating the PDF as a bucket of shapes and text, the contextClassifier walks the document and groups every single item into spatially bounded, typed regions: TABLE, PARAGRAPH, HEADING, LIST, and IMAGE.
But how do you tell a machine the difference between a table border and a decorative underline?
You use proximity math.
// KD-tree style proximity: check if text sits exactly on top of an H-line
for (const h of hSegs) {
const hY = (h.y1 + h.y2) / 2;
for (const tm of textMeta) {
const yDist = hY - tm.vy;
// Underline: line is 0–5px below the text baseline
if (yDist >= -1 && yDist <= 5 && overlappingXSpan(tm, h)) {
underlineSegIds.add(h.id);
break;
}
}
}
If a horizontal line is exactly 0 to 5 pixels below a text baseline, and its width roughly matches the text width, it's not a table border. It's an underline.
By tagging and removing these underlines before we run the table reconstruction, we eliminate 99% of phantom tables.
3. Scoping the Text (No More Collisions)
Once the tables are detected, we calculate the exact bounding box of the table grid.
Instead of throwing all the document's text at the table builder, the classifier scoops up only the text items that physically live inside that bounding box.
const tableTextIndices = [];
for (const tm of textMeta) {
if (insideBBox(tm.vx, tm.vy, bbox)) {
tableTextIndices.push(tm.idx);
assignedTextIndices.add(tm.idx); // Mark as consumed!
}
}
This does two things:
- It guarantees that table text doesn't accidentally leak into paragraphs.
- It guarantees that paragraph text doesn't get sucked into table cells.
4. The Nearest-Cell Proximity Assignment
Even with scoped text, getting the text into the correct table cell was still failing due to PDF rendering quirks. A cell might be at x: 10.5, but the text was at x: 10.4. A strict bounding box check would drop the text.
I ripped out the strict containment checks and replaced them with a nearest-neighbor proximity model.
For every piece of text, we find its nearest cell center using Euclidean distance. If it's within a 15px threshold, it snaps into place. No more jitter. No more dropped data.
5. The Page Assembler
Finally, the pageAssembler takes over.
It receives an array of perfectly classified, non-overlapping regions. It sorts them top-to-bottom based on their Y-coordinates.
Then, it just iterates through them and calls the right extractor:
- If it's a
TABLE, it sends the scoped text to thetableBuilder. - If it's a
HEADING, it wraps it in an<h3>or<h4>. - If it's a
LIST, it strips the bullet points and outputs clean<ul><li>tags. - If it's a
PARAGRAPH, it sends it to thetextRebuilder.
You upload a messy, complex PDF filled with tables, paragraphs, and lists. The pipeline classifies it, scopes the data, and spits out clean, semantically correct HTML.
No backend processing. No AI hallucination. Just pure, deterministic math running directly in your browser using pdfjs-dist and vanilla JS.
The PDF is finally readable.