Borrowed Math, Original Niches: What's New When the Algorithms Aren't
TLDR: When you wonder whether you're reinventing a wheel, the right question isn't "has anyone solved this problem?" The answer to that is almost always yes, somewhere. The right question is: "what set of constraints does my version satisfy that nobody else's version satisfies?" Constraints are commitments. The intersection of constraints is where new niches live. The math inside the niche is often the same math everyone else uses. The originality is in the negative space, the things you said no to. This is part 3 of a series on the design space of frontend PDF extraction.
Repo: tools/pdf-processor
What's borrowed
Worth being honest about which pieces of this pipeline are original engineering versus academic standard.
- The lattice algorithm itself. Intersection clustering, row/column projection. Same as Tabula, Camelot, pdfplumber.
- Y-band paragraph clustering. pdfminer-style, in academic literature since the 1990s.
- XY-cut column detection. Known since the 1980s.
- KD-tree spatial indexing. Textbook.
- DOMPurify, jQuery, Monaco. Off-the-shelf.
What's new
- The full assembly running in a Web Worker on top of PDF.js as a nested worker.
- The non-overlapping-region invariant via
assignedTextIndices. - The underline-discrimination heuristic with the specific 0-5px and 2.5x-width thresholds.
- The coordinate-space discipline: storing both
vWidth/vFont(viewport) andwidth/fontSize(PDF points) on every text-meta record, with explicit comments about which to use where. - The per-page streaming pattern that survives 100+ page documents.
- The integration with
VisualGridMapperfor downstream mathematical manipulation.
Why the empty cell stayed empty
If the deterministic-structural-frontend cell is valuable, why hadn't anyone filled it?
Three reasons, in order of how convincing each one is.
Economics push toward backend. If you have a use case that needs structural PDF extraction, you almost certainly have a server. The serious tools live in Python and have for a decade. There's no incentive to port them unless you specifically need data to stay on the client device, which is a real but niche requirement.
Existing frontend tools are anchored to other quadrants. pdf2htmlEX is committed to visual fidelity. tesseract.js is committed to OCR. The transformers.js camp is committed to ML generalization. Each is well-architected for its quadrant and would require an architectural rewrite to drift into the deterministic-structural cell. Nobody had a reason to do that work.
The pieces are scattered. PDF.js gives you the operator list but assumes you'll use it for rendering. Lattice algorithms are described in papers, not packaged as npm modules. KD-tree libraries assume preformatted data. Web Worker isolation has its own ergonomic learning curve. Climbing the staircase to assemble all of these is real engineering work, and unless you have a strong reason to be in this exact cell, the cost-benefit doesn't pencil.
We had a reason. The platform we're building is browser-native by commitment, not accident. Every other tool in our pipeline runs in the browser. Sending PDFs to a server for structural extraction would have broken the architectural model. So we climbed.
The lesson above the niche
There's a generalization worth saying out loud, because it applies far beyond PDF tooling.
When you wonder whether you're reinventing a wheel, do the survey. But ask the right question. The question is not "has anyone solved this problem?" The answer to that is almost always yes, somewhere.
The question is:
What set of constraints does my version satisfy that nobody else's version satisfies?
Constraints are commitments. No backend. No model weights. Worker isolation. Deterministic output. Per-page streaming. Open source. Each one is a deliberate refusal of a path other people took.
The intersection of constraints is where new niches live. The math you use inside that intersection is often the same math everyone else uses. That's fine. The originality isn't in the math. It's in the negative space, the things you said no to.
The pipeline isn't different because the algorithms are different. It's different because of where it runs and what it refuses to be.
The full pipeline is open source. If you find a fifth camp we missed, or if you've built something that fills the empty quadrant differently, the issue tracker is open. I'm specifically curious whether anyone else has implemented in-browser CTM baking against pdfjs-dist's operator list. That piece felt the loneliest in the survey.