Pdf Processor

Eight Architectural Moves That Fill the Empty Quadrant of Frontend PDF Extraction

2026-07-11

TLDR: The deterministic-structural-frontend cell asks for a tool that runs entirely in the browser, ships no ML weights, reads the operator list (not just text), and outputs semantic structure. Filling it required eight specific architectural moves. Each one is a deliberate departure from how other frontend PDF extractors work. None of them are clever in isolation. The composition is what fills the niche. This is part 2 of a series on the design space of frontend PDF extraction.

Repo: tools/pdf-processor

The four constraints

The empty cell from part 1 asks for a tool that:

Runs entirely in the browser. No server.
Ships no ML model weights. Determinism via geometry.
Reads the operator list, not just the text content. Vector-aware.
Outputs semantic structure: tables with topology, headings, paragraphs, lists, reading order.

Filling it required these eight moves.

1. CTM-baked vector segments

// ctmAdapter.js (simplified)
for (let i = 0; i < fnArray.length; i++) {
  if (fnArray[i] === OPS.save)    ctmStack.push(ctm.slice());
  if (fnArray[i] === OPS.restore) ctm = ctmStack.pop();
  if (fnArray[i] === OPS.transform) ctm = mulMatrix(ctm, argsArray[i]);
  if (fnArray[i] === OPS.constructPath) {
    // Walk subpaths, transform each point through CTM × viewport.transform,
    // emit normalized H/V segment records.
  }
}

This is the move that puts the pipeline in a different category from text-only and OCR cells. We don't just consume text. We consume the operator list and reconstruct the page's vector skeleton in viewport coordinates. We can see the table borders before any text math runs.

2. Region-typed classification before extraction

Most pipelines run sequential passes: find tables, find paragraphs, find lists. Each pass works against the full text pool. Then you deduplicate at the end and hope the passes didn't disagree.

This pipeline does the opposite. Classify regions first, then route scoped text into each region's specialist extractor. The mechanism is a single assignedTextIndices set:

for (const lattice of lattices) {
  const tableTextIndices = [];
  for (const tm of textMeta) {
    if (assignedTextIndices.has(tm.idx)) continue; // skip consumed
    if (insideBBox(tm.vx, tm.vy, lattice.bbox, tablePad)) {
      tableTextIndices.push(tm.idx);
      assignedTextIndices.add(tm.idx); // mark as consumed
    }
  }
  regions.push({ type: TABLE, lattice, textItemIndices: tableTextIndices });
}
// later: paragraph/heading/list passes only see un-consumed text

The invariant is: a text item belongs to exactly one region. No leakage by construction. The bug class of "table text accidentally in a paragraph" is preempted, not patched.

3. Underline-vs-border discrimination

A naive lattice reconstructor sees every horizontal line and tries to use it as a table border. This produces phantom 1x1 tables under every underlined heading.

We classify each H-segment against the text baselines using KD-tree-style proximity:

for (const h of hSegs) {
  const hY = (h.y1 + h.y2) / 2;
  for (const tm of textMeta) {
    const yDist = hY - tm.vy;
    if (yDist >= -1 && yDist <= 5 &&
        tm.vx <= hXMax + 2 && (tm.vx + tm.vWidth) >= hXMin - 2 &&
        hLen < tm.vWidth * 2.5) {
      underlineSegIds.add(h.id);
      break;
    }
  }
}

If a horizontal line sits 0-5px below a text baseline with overlapping X-span, it's an underline. Tag it. Remove from the table-detection pool. About 99 percent of phantom tables disappear.

I have not seen another browser-side PDF extractor that does this. Tabula has equivalents on the backend. On the frontend, every other tool I've audited just hands all H-lines to the lattice and lives with phantom tables.

4. Topological cell-merge inference

Naive table extractors detect cell merges by visual whitespace heuristics ("if these two cells have no visible boundary between their text, they're merged"). This is unreliable. Tables with thin internal borders look unmerged but are. Tables with wide cell padding look merged but aren't.

This pipeline asks the geometry directly:

function vLinePresent(vLines, x, yA, yB, eps) {
  return vLines.some(l =>
    Math.abs(l.x - x) <= eps &&
    l.yMin <= yA + eps &&
    l.yMax >= yB - eps
  );
}

Is there an actual merged vertical-line record at this X position spanning [yA, yB]? If yes, the cell boundary exists, the cells are separate. If no, extend the colspan. Topological, not visual.

5. Nearest-cell Euclidean snap

Strict point-in-box assignment drops text whose origin is 0.1px outside a cell, which is common because PDF rendering coordinates have jitter. We use Euclidean distance to the nearest cell center with a 15px snap threshold:

let bestR = -1, bestC = -1, minDist = Infinity;
for (let ri = 0; ri < numRows; ri++) {
  for (let ci = 0; ci < numCols; ci++) {
    const dx = Math.max(cols[ci] - sx, 0, sx - cols[ci+1]);
    const dy = Math.max(rows[ri] - sy, 0, sy - rows[ri+1]);
    const dist = Math.sqrt(dxdx + dydy);
    if (dist < minDist) { minDist = dist; bestR = ri; bestC = ci; }
  }
}
if (minDist < 15) cells[bestR][bestC].push(...);

Magnetic, not literal. Coordinate jitter doesn't drop data.

6. Worker-isolated full pipeline

Most browser PDF extractors run on the main thread. The geometry pipeline here loads PDF.js as a nested worker inside the geometry worker. CTM baking, lattice reconstruction, classification, assembly: all off the main thread. The UI stays responsive on a 200-page document.

7. Per-page streaming

Naive extractors accumulate the whole document into one structured-clone payload at the end. That dies on large PDFs with stack-overflow errors in postMessage.

We emit per-page 'page' messages from the worker, the main thread accumulates incrementally, and the UI can show progressive results.

self.postMessage({
  type: 'page',
  page: p,
  html: result.html,
  text: result.text.trim(),
  tables: result.tableCount,
});

Not algorithmic novelty. Engineering discipline that lets the architecture survive 76-page technical manuals.

8. VisualGridMapper as a downstream operator

The output isn't a dead <table> string. It's a live HTML table that we can immediately remap into a Cartesian array using VisualGridMapper:

const mapper = new VisualGridMapper(table);
// mapper.grid[row][col] now holds origin/spanned cell metadata.
// Transposes, merges, splits all become matrix operations.

This is the bridge into the table-formatter half of the platform. Other extractors stop at "here's a <table>." We hand the user something they can keep manipulating mathematically.

What the eight moves add up to

Each move is small. None of them are clever in isolation. CTM walks are textbook. Region classification is from the 1990s. KD-tree proximity is undergraduate computer science.

The composition is what fills the niche. Take any move away and you're back in one of the four occupied cells. CTM walking without region classification gives you visual fidelity but no semantics. Region classification without vector awareness gives you text-only Y-clustering. Worker isolation without per-page streaming chokes on large documents.

The contribution isn't algorithmic. It's architectural.

Part 3 is honest about which pieces are borrowed and which are unusual, and steps back to the broader lesson about constraints and niches.

Read this post in the full Engineering Journal →