Engineering Journal
Pdf Processor
Pdf Processor

Mapping the 2x2 of Frontend PDF Extraction (And the Empty Quadrant)

2026-05-30

TLDR: Two axes describe almost every PDF extraction project: deterministic vs ML-based, visual fidelity vs semantic structure. That produces four cells. Three are crowded with serious tools. One is empty. This post catalogs what's in each occupied cell and names the architectural trade each tool made to land where it landed. This is part 1 of a series on the design space of frontend PDF extraction.

Repo: tools/pdf-processor

The 2x2 grid

                 DETERMINISTIC                 ML-BASED
                ───────────────                ────────
  SEMANTIC      pdfplumber, Tabula,            Adobe Extract API,
  STRUCTURE     Camelot, PyMuPDF               Textract, Azure DI,
  (backend)                                     transformers.js + layout
                                                models (frontend)

VISUAL pdf2htmlEX (empty) FIDELITY (frontend WASM) (frontend)

TEXT-ONLY pdfreader, pdf-extract, tesseract.js STREAM the naive getTextContent() (OCR over rendered canvas) recipe

Three observations fall out immediately.

Backend dominates the deterministic-structural cell. Everything serious about extracting structure from PDFs without ML lives on a server. pdfplumber, Tabula, Camelot, PyMuPDF: all Python, all backend, all decades of accumulated implementation knowledge.

Frontend is well-represented but compromised. Each frontend project gives up something significant. We'll catalog the trades below.

There's a frontend cell that's empty. Deterministic. Structural. No ML weights. No raster step. No backend. That's the empty quadrant. We'll come back to why it stayed empty in part 3.

The naive 95 percent

Before we look at the four occupied cells, the baseline. Roughly 95 percent of frontend PDF extraction code in the wild does this:

const pdf = await pdfjsLib.getDocument(bytes).promise;
let text = '';
for (let i = 1; i <= pdf.numPages; i++) {
  const page = await pdf.getPage(i);
  const content = await page.getTextContent();
  text += content.items.map(it => it.str).join(' ') + '\n';
}

This works on a memo. It collapses on a two-column research paper. It liquefies on a table. It can't tell a heading from a paragraph. It has no concept of reading order on a complex page.

Everything beyond this baseline is a project trying harder. There are four serious such projects. None of them sits in the deterministic-structural-frontend cell.

Cell A: pdf2htmlEX, visual fidelity, no semantics

pdf2htmlEX is a WASM port of an old C++ project. It walks the PDF and emits absolutely-positioned <div>s that visually reproduce the source.

<div style="position:absolute; top:124px; left:88px; font-size:11pt">A table cell</div>
<div style="position:absolute; top:124px; left:240px; font-size:11pt">Another</div>
<div style="position:absolute; top:124px; left:392px; font-size:11pt">Cell</div>

If you want to render the PDF in a browser and let the user select text, this is unbeatable.

If you want any semantic structure (a <table>, an <h1>, a paragraph block), you're back to scraping divs by their bounding boxes. The same problem the user started with.

Cell B: tesseract.js, OCR

Render each page to canvas. Run OCR on the canvas. Get text and bounding boxes back.

This is the right answer for scanned PDFs that have no native text layer.

It's the wrong answer for digital PDFs that already have perfect text. You're feeding selectable text through an image-to-text model and getting a degraded copy of what was already there. Plus a 2MB WASM payload. Plus seconds-per-page latency.

Cell C: transformers.js + layout models

Load a layout-aware model (DocLayout-YOLO, LayoutLM, or similar) into the browser via ONNX or transformers.js. Render each page to canvas. Run inference. Get back labeled regions: TABLE, TEXT, FIGURE.

This is where the modern industry is heading. It works on weird, varied document types. It generalizes.

But the trade-offs are real:

Cell D: pdfreader, pdf-extract, and the Y-clustering camp

These libraries take getTextContent() items, cluster them by Y position, sort by X, and produce slightly more structured output than the flat-blob recipe.

The fundamental limit: they only consume the text content. They never call getOperatorList(). They cannot see vector lines. They cannot detect a table border, distinguish an underline from a horizontal rule, or recognize a chart axis. Their world is text and only text.

For prose-heavy documents, that's fine. For anything with tables, they degrade to row-smashing.

What's missing

Each of the four occupied frontend cells is well-architected for its quadrant. Each makes a deliberate trade.

None of them gives you all four properties: browser-native, no ML weights, vector-aware, semantically structured. That's the empty quadrant.

Filling it required eight specific architectural moves. That's part 2.

Read this post in the full Engineering Journal →