Pdf Processor

Mapping the 2x2 of Frontend PDF Extraction (And the Empty Quadrant)

2026-07-11

TLDR: Two axes describe almost every PDF extraction project: deterministic vs ML-based, visual fidelity vs semantic structure. That produces four cells. Three are crowded with serious tools. One is empty. This post catalogs what's in each occupied cell and names the architectural trade each tool made to land where it landed. This is part 1 of a series on the design space of frontend PDF extraction.

Repo: tools/pdf-processor

The 2x2 grid

DETERMINISTIC ML-BASED ─────────────── ──────── SEMANTIC pdfplumber, Tabula, Adobe Extract API, STRUCTURE Camelot, PyMuPDF Textract, Azure DI, (backend) transformers.js + layout models (frontend) VISUAL pdf2htmlEX (empty) FIDELITY (frontend WASM) (frontend)

TEXT-ONLY pdfreader, pdf-extract, tesseract.js STREAM the naive getTextContent() (OCR over rendered canvas) recipe

Three observations fall out immediately.

Backend dominates the deterministic-structural cell. Everything serious about extracting structure from PDFs without ML lives on a server. pdfplumber, Tabula, Camelot, PyMuPDF: all Python, all backend, all decades of accumulated implementation knowledge.

Frontend is well-represented but compromised. Each frontend project gives up something significant. We'll catalog the trades below.

There's a frontend cell that's empty. Deterministic. Structural. No ML weights. No raster step. No backend. That's the empty quadrant. We'll come back to why it stayed empty in part 3.

The naive 95 percent

Before we look at the four occupied cells, the baseline. Roughly 95 percent of frontend PDF extraction code in the wild does this:

const pdf = await pdfjsLib.getDocument(bytes).promise;
let text = '';
for (let i = 1; i <= pdf.numPages; i++) {
  const page = await pdf.getPage(i);
  const content = await page.getTextContent();
  text += content.items.map(it => it.str).join(' ') + '\n';
}

This works on a memo. It collapses on a two-column research paper. It liquefies on a table. It can't tell a heading from a paragraph. It has no concept of reading order on a complex page.

Everything beyond this baseline is a project trying harder. There are four serious such projects. None of them sits in the deterministic-structural-frontend cell.

Cell A: pdf2htmlEX, visual fidelity, no semantics

pdf2htmlEX is a WASM port of an old C++ project. It walks the PDF and emits absolutely-positioned <div>s that visually reproduce the source.

<div style="position:absolute; top:124px; left:88px; font-size:11pt">A table cell</div>
<div style="position:absolute; top:124px; left:240px; font-size:11pt">Another</div>
<div style="position:absolute; top:124px; left:392px; font-size:11pt">Cell</div>

If you want to render the PDF in a browser and let the user select text, this is unbeatable.

If you want any semantic structure (a <table>, an <h1>, a paragraph block), you're back to scraping divs by their bounding boxes. The same problem the user started with.

Cell B: tesseract.js, OCR

Render each page to canvas. Run OCR on the canvas. Get text and bounding boxes back.

This is the right answer for scanned PDFs that have no native text layer.

It's the wrong answer for digital PDFs that already have perfect text. You're feeding selectable text through an image-to-text model and getting a degraded copy of what was already there. Plus a 2MB WASM payload. Plus seconds-per-page latency.

Cell C: transformers.js + layout models

Load a layout-aware model (DocLayout-YOLO, LayoutLM, or similar) into the browser via ONNX or transformers.js. Render each page to canvas. Run inference. Get back labeled regions: TABLE, TEXT, FIGURE.

This is where the modern industry is heading. It works on weird, varied document types. It generalizes.

But the trade-offs are real:

The model weights are megabytes (DocLayout-YOLO Nano alone is ~6MB ONNX).
First inference takes seconds.
Failure modes are opaque. When the model misclassifies, you have no levers.
You're shipping an ML inference engine to do something that, for digital PDFs, can be done with pure geometry.

Cell D: pdfreader, pdf-extract, and the Y-clustering camp

These libraries take getTextContent() items, cluster them by Y position, sort by X, and produce slightly more structured output than the flat-blob recipe.

The fundamental limit: they only consume the text content. They never call getOperatorList(). They cannot see vector lines. They cannot detect a table border, distinguish an underline from a horizontal rule, or recognize a chart axis. Their world is text and only text.

For prose-heavy documents, that's fine. For anything with tables, they degrade to row-smashing.

What's missing

Each of the four occupied frontend cells is well-architected for its quadrant. Each makes a deliberate trade.

pdf2htmlEX trades semantics for visual fidelity.
tesseract.js trades the native text layer for OCR robustness.
transformers.js trades deterministic geometry for ML generalization.
pdfreader and friends trade vector awareness for simplicity.

None of them gives you all four properties: browser-native, no ML weights, vector-aware, semantically structured. That's the empty quadrant.

Filling it required eight specific architectural moves. That's part 2.

Read this post in the full Engineering Journal →