Pdf Processor

The Empty Quadrant: Mapping the Design Space of Frontend PDF Extraction

2026-05-11

A user asked me a sharp question yesterday:

Looking at your extraction pipeline, pdfjs + geometryWorker + lattice + visualGridMapper, what makes this any different from any other extraction approach for frontend only, no backend or compiled engine?

It's the right question to ask any author of a tool. So I sat down and surveyed the space honestly. What I found was more interesting than my gut answer.

The pipeline isn't different because of clever algorithms. The lattice reconstruction is the same lattice reconstruction every server-side tool uses. The KD-tree proximity is a textbook nearest-neighbor query. Y-band paragraph clustering is in a 1996 paper. The math is borrowed.

What's different is the quadrant of the design space the pipeline occupies, and the architectural commitments it took to land there.

This post maps that design space. It catalogs what's already in each cell, identifies the empty one, and explains why it stayed empty long enough for a niche to form.

1. The 2×2 grid

Two axes describe almost every PDF extraction project I've encountered:

Approach axis: deterministic vs. ML-based.
Output axis: visual fidelity vs. semantic structure.

Plot them and you get four cells.

DETERMINISTIC ML-BASED ─────────────── ──────── SEMANTIC pdfplumber, Tabula, Adobe Extract API, STRUCTURE Camelot, PyMuPDF Textract, Azure DI, (backend) transformers.js + layout models (frontend) VISUAL pdf2htmlEX — FIDELITY (frontend WASM) (frontend)

TEXT-ONLY pdfreader, pdf-extract, tesseract.js STREAM the naive getTextContent() (OCR over rendered canvas) recipe

Three observations fall out of this map immediately.

Backend dominates the deterministic-structural cell. Everything serious about extracting structure from PDFs without ML lives on a server. pdfplumber, Tabula, Camelot, PyMuPDF — all Python, all backend, all decades of accumulated implementation knowledge.

Frontend is well-represented but compromised. Each frontend project gives up something significant. pdf2htmlEX reproduces visual appearance perfectly but ships zero semantic structure. tesseract.js works on scanned PDFs but throws away the native text layer that digital PDFs hand you for free. The transformers.js + layout-model approach handles weird documents but ships multi-megabyte model weights and opaque failure modes. The naive getTextContent() recipe and its Y-clustering descendants give you a flat blob and don't read the operator list at all.

There's a frontend cell that's empty. Deterministic. Structural. No ML weights. No raster step. No backend.

That empty cell is where this pipeline sits.

2. The naive 95 percent

Before we look at what fills the four occupied cells, it's worth establishing the baseline. Roughly 95 percent of frontend PDF extraction code in the wild does this:

const pdf = await pdfjsLib.getDocument(bytes).promise;
let text = '';
for (let i = 1; i <= pdf.numPages; i++) {
  const page = await pdf.getPage(i);
  const content = await page.getTextContent();
  text += content.items.map(it => it.str).join(' ') + '\n';
}

This works on a memo. It collapses on a two-column research paper. It liquefies on a table. It can't tell a heading from a paragraph. It has no concept of reading order on a complex page.

Everything beyond this baseline is a project trying harder. There are four serious such projects. None of them sits in the deterministic-structural-frontend cell.

3. The four occupied frontend cells

Cell A: `pdf2htmlEX` — visual fidelity, no semantics

pdf2htmlEX is a WASM port of an old C++ project. It walks the PDF and emits absolutely-positioned <div>s that visually reproduce the source.

<div style="position:absolute; top:124px; left:88px; font-size:11pt">A table cell</div>
<div style="position:absolute; top:124px; left:240px; font-size:11pt">Another</div>
<div style="position:absolute; top:124px; left:392px; font-size:11pt">Cell</div>

If you want to render the PDF in a browser and let the user select text, this is unbeatable. If you want any semantic structure (a <table>, an <h1>, a paragraph block), you're back to scraping divs by their bounding boxes — the same problem the user started with.

Cell B: `tesseract.js` — OCR

Render each page to canvas. Run OCR on the canvas. Get text + bounding boxes back.

This is the right answer for scanned PDFs that have no native text layer. It's the wrong answer for digital PDFs that already have perfect text. You're feeding selectable text through an image-to-text model and getting a degraded copy of what was already there. Plus a 2MB WASM payload, plus seconds-per-page latency.

Cell C: `transformers.js` + layout models — ML-based structural

Load a layout-aware model (DocLayout-YOLO, LayoutLM, or similar) into the browser via ONNX or transformers.js. Render each page to canvas. Run inference. Get back labeled regions: TABLE, TEXT, FIGURE.

This is where the modern industry is heading. It works on weird, varied document types. It generalizes. But:

The model weights are megabytes (DocLayout-YOLO Nano alone is ~6MB ONNX).
First inference takes seconds.
Failure modes are opaque — when the model misclassifies, you have no levers.
You're shipping an ML inference engine to do something that, for digital PDFs, can be done with pure geometry.

Cell D: `pdfreader`, `pdf-extract`, and friends — text-only Y-clustering

These libraries take getTextContent() items, cluster them by Y position, sort by X, and produce slightly more structured output than the flat-blob recipe.

The fundamental limit: they only consume the text content. They never call getOperatorList(). They cannot see vector lines. They cannot detect a table border, distinguish an underline from a horizontal rule, or recognize a chart axis. Their world is text and only text.

For prose-heavy documents, that's fine. For anything with tables, they degrade to row-smashing.

4. The empty cell, and what fills it

The deterministic-structural-frontend cell asks for a tool that:

Runs entirely in the browser. No server.
Ships no ML model weights. Determinism via geometry.
Reads the operator list, not just the text content. Vector-aware.
Outputs semantic structure: tables with topology, headings, paragraphs, lists, reading order.

To fill it, this pipeline does the following:

4.1 CTM-baked vector segments

// ctmAdapter.js (simplified)
for (let i = 0; i < fnArray.length; i++) {
  if (fnArray[i] === OPS.save)    ctmStack.push(ctm.slice());
  if (fnArray[i] === OPS.restore) ctm = ctmStack.pop();
  if (fnArray[i] === OPS.transform) ctm = mulMatrix(ctm, argsArray[i]);
  if (fnArray[i] === OPS.constructPath) {
    // Walk subpaths, transform each point through CTM × viewport.transform,
    // emit normalized H/V segment records.
  }
}

This is the move that puts the pipeline in a different category from cells B and D. We don't just consume text. We consume the operator list and reconstruct the page's vector skeleton in viewport coordinates. We can see the table borders before any text math runs.

4.2 Region-typed classification before extraction

Most pipelines run sequential passes: find tables, find paragraphs, find lists. Each pass works against the full text pool. Then you deduplicate at the end and hope the passes didn't disagree.

This pipeline does the opposite. Classify regions first, then route scoped text into each region's specialist extractor. The mechanism is a single assignedTextIndices set:

for (const lattice of lattices) {
  const tableTextIndices = [];
  for (const tm of textMeta) {
    if (assignedTextIndices.has(tm.idx)) continue; // skip consumed
    if (insideBBox(tm.vx, tm.vy, lattice.bbox, tablePad)) {
      tableTextIndices.push(tm.idx);
      assignedTextIndices.add(tm.idx); // mark as consumed
    }
  }
  regions.push({ type: TABLE, lattice, textItemIndices: tableTextIndices });
}
// later: paragraph/heading/list passes only see un-consumed text

The invariant is: a text item belongs to exactly one region. No leakage by construction. The bug class of "table text accidentally in a paragraph" is preempted, not patched.

4.3 Underline-vs-border discrimination

A naive lattice reconstructor sees every horizontal line and tries to use it as a table border. This produces phantom 1×1 tables under every underlined heading.

We classify each H-segment against the text baselines using KD-tree-style proximity:

for (const h of hSegs) {
  const hY = (h.y1 + h.y2) / 2;
  for (const tm of textMeta) {
    const yDist = hY - tm.vy;
    if (yDist >= -1 && yDist <= 5 &&
        tm.vx <= hXMax + 2 && (tm.vx + tm.vWidth) >= hXMin - 2 &&
        hLen < tm.vWidth * 2.5) {
      underlineSegIds.add(h.id);
      break;
    }
  }
}

If a horizontal line sits 0–5px below a text baseline with overlapping X-span, it's an underline. Tag it. Remove from the table-detection pool. ~99% of phantom tables disappear.

I have not seen another browser-side PDF extractor that does this. Tabula has equivalents on the backend. On the frontend, every other tool I've audited just hands all H-lines to the lattice and lives with phantom tables.

4.4 Topological cell-merge inference

Naive table extractors detect cell merges by visual whitespace heuristics ("if these two cells have no visible boundary between their text, they're merged"). This is unreliable. Tables with thin internal borders look unmerged but are; tables with wide cell padding look merged but aren't.

This pipeline asks the geometry directly:

function vLinePresent(vLines, x, yA, yB, eps) {
  return vLines.some(l =>
    Math.abs(l.x - x) <= eps &&
    l.yMin <= yA + eps &&
    l.yMax >= yB - eps
  );
}

Is there an actual merged vertical-line record at this X position spanning [yA, yB]? If yes, the cell boundary exists; the cells are separate. If no, extend the colspan. Topological, not visual.

4.5 Nearest-cell Euclidean snap

Strict point-in-box assignment drops text whose origin is 0.1px outside a cell, which is common because PDF rendering coordinates have jitter. We use Euclidean distance to the nearest cell center with a 15px snap threshold:

let bestR = -1, bestC = -1, minDist = Infinity;
for (let ri = 0; ri < numRows; ri++) {
  for (let ci = 0; ci < numCols; ci++) {
    const dx = Math.max(cols[ci] - sx, 0, sx - cols[ci+1]);
    const dy = Math.max(rows[ri] - sy, 0, sy - rows[ri+1]);
    const dist = Math.sqrt(dxdx + dydy);
    if (dist < minDist) { minDist = dist; bestR = ri; bestC = ci; }
  }
}
if (minDist < 15) cells[bestR][bestC].push(...);

Magnetic, not literal. Coordinate jitter doesn't drop data.

4.6 Worker-isolated full pipeline

Most browser PDF extractors run on the main thread. The geometry pipeline here loads PDF.js as a nested worker inside the geometry worker. CTM baking, lattice reconstruction, classification, assembly — all off the main thread. The UI stays responsive on a 200-page document.

4.7 Per-page streaming

Naive extractors accumulate the whole document into one structured-clone payload at the end. That dies on large PDFs with stack-overflow errors in postMessage. We emit per-page 'page' messages from the worker, the main thread accumulates incrementally, and the UI can show progressive results.

self.postMessage({
  type: 'page',
  page: p,
  html: result.html,
  text: result.text.trim(),
  tables: result.tableCount,
});

Not algorithmic novelty. Engineering discipline that lets the architecture survive 76-page technical manuals.

4.8 VisualGridMapper as a downstream operator

The output isn't a dead <table> string. It's a live HTML table that we can immediately remap into a Cartesian array using VisualGridMapper:

const mapper = new VisualGridMapper(table);
// mapper.grid[row][col] now holds origin/spanned cell metadata.
// Transposes, merges, splits all become matrix operations.

This is the bridge into the table-formatter half of the platform. Other extractors stop at "here's a <table>." We hand the user something they can keep manipulating mathematically.

5. What's borrowed and what's new

Worth being honest about which pieces of this are original engineering versus academic standard:

Borrowed:

The lattice algorithm itself — intersection clustering, row/column projection. Same as Tabula, Camelot, pdfplumber.
Y-band paragraph clustering — pdfminer-style, in academic literature since the 90s.
XY-cut column detection — known since the 80s.
KD-tree spatial indexing — textbook.
DOMPurify, jQuery, Monaco — off-the-shelf.

Original to this pipeline (or unusual in the niche):

The full assembly running in a Web Worker on top of PDF.js as a nested worker.
The non-overlapping-region invariant via assignedTextIndices.
The underline-discrimination heuristic with the specific 0–5px / 2.5×-width thresholds.
The coordinate-space discipline: storing both vWidth/vFont (viewport) and width/fontSize (PDF points) on every text-meta record, with explicit comments about which to use where.
The per-page streaming pattern that survives 100+ page documents.
The integration with VisualGridMapper for downstream mathematical manipulation.

The pipeline is a composition. The composition is the contribution.

6. Why this cell stayed empty

If the deterministic-structural-frontend cell is valuable, why hadn't anyone filled it?

Three reasons, in order of how convincing each one is.

Economics push toward backend. If you have a use case that needs structural PDF extraction, you almost certainly have a server. The serious tools live in Python and have for a decade. There's no incentive to port them unless you specifically need data to stay on the client device — which is a real but niche requirement.

Existing frontend tools are anchored to other quadrants. pdf2htmlEX is committed to visual fidelity. tesseract.js is committed to OCR. The transformers.js camp is committed to ML generalization. Each is well-architected for its quadrant and would require an architectural rewrite to drift into the deterministic-structural cell. Nobody had a reason to do that work.

The pieces are scattered. PDF.js gives you the operator list but assumes you'll use it for rendering. Lattice algorithms are described in papers, not packaged as npm modules. KD-tree libraries assume preformatted data. Web Worker isolation has its own ergonomic learning curve. Climbing the staircase to assemble all of these is real engineering work, and unless you have a strong reason to be in this exact cell, the cost-benefit doesn't pencil.

We had a reason. The platform we're building is browser-native by commitment, not accident. Every other tool in our pipeline runs in the browser. Sending PDFs to a server for structural extraction would have broken the architectural model. So we climbed.

7. The lesson above the niche

There's a generalization worth saying out loud, because it applies far beyond PDF tooling.

When you wonder whether you're reinventing a wheel, do the survey. But ask the right question. The question is not "has anyone solved this problem?" — the answer to that is almost always yes, somewhere. The question is:

What set of constraints does my version satisfy that nobody else's version satisfies?

Constraints are commitments. No backend. No model weights. Worker isolation. Deterministic output. Per-page streaming. Open source. Each one is a deliberate refusal of a path other people took.

The intersection of constraints is where new niches live. The math you use inside that intersection is often the same math everyone else uses. That's fine. The originality isn't in the math. It's in the negative space — the things you said no to.

The pipeline isn't different because the algorithms are different. It's different because of where it runs and what it refuses to be.

The full pipeline is open source as part of the GINEXYS project. If you find a fifth camp I missed, or if you've built something that fills the empty quadrant differently, the issue tracker is open. I'm specifically curious whether anyone else has implemented in-browser CTM baking against pdfjs-dist's operator list — that piece felt the loneliest in my survey.

Read this post in the full Engineering Journal →