Pdf Processor

Tier Restructure Hottake

2026-07-11

Hot Take: Most PDF Extractors Are Using the Third-Best API Available

TLDR: PDF.js exposes three data sources at three fidelity levels. The industry default is the one that was built as a convenience wrapper for the other two. This is not laziness, because there are real reasons it happened, but it is the root cause of why most frontend PDF extraction breaks on academic papers, publications, and anything that isn't a corporate report.

The Hierarchy Nobody Talks About

When people say "PDF extraction," they mean getTextContent(). Text items, positions, advance widths. This is what pdfplumber, PyMuPDF, pdf-parse, and almost every browser-side PDF tool reads.

Here is what getTextContent() actually is: a derived, post-processed view of getOperatorList(). PDF.js collects text paint operators from the raw operator stream, applies the current CTM, and packages the results. It is not reading a different part of the PDF. It is giving you a processed version of data that is already available in a more complete form.

Above that: getStructTree(). Not derived from the paint stream at all. It reads the logical structure tree from the PDF cross-reference table. Tables, paragraphs, headings, figures, formulas. Every glyph run tagged with its semantic role, linked to the paint stream via Marked Content IDs.

The hierarchy is:

getStructTree()     # what the document means getOperatorList()   # what the document draws getTextContent()    # a filtered view of what the document draws

Most tools use the third one.

Why This Happened

There are real reasons getTextContent() became the default:

It is good enough for 80% of documents. Corporate reports, legal briefs, and simple technical manuals have straightforward text flows. getTextContent() gives you positioned text items and that is enough to reconstruct paragraphs and headers.

The struct tree is frequently wrong. Word exports tag table cells as <P>. InDesign creates arbitrary nesting that reflects layer creation order, not reading order. A tool that trusts the struct tree on arbitrary input will fail on a significant fraction of documents.

The MCID join is not automatic. PDF.js does not give you "text item → struct tree node" in one call. You have to walk the operator list, maintain a MCID stack at each BMC/BDC open/close, record the current MCID for each text paint op, and join that to the struct tree. That is non-trivial to implement correctly.

Toolchain inertia. PDFBox, pdfminer, and the other foundational tools are 10–15 years old. They prioritized the text content API. Everything built on top of them inherited the same priority.

These are valid reasons. They are also not the same as "getTextContent() is correct."

What You Miss

When you use only getTextContent(), you miss:

Table structure. The struct tree gives you Table → TR → TD directly. getTextContent() gives you positioned text items that happen to be inside table cells. You have to infer the table grid from item positions, which requires heuristics, thresholds, and fails on borderless tables.

Display-math blocks. LaTeX equation environments produce glyph runs that PDF.js collapses into single items in getTextContent(). The full equation arrives as one item whose width spans the display block. Individual characters are not surfaced. Trying to detect column boundaries on a LaTeX paper using item X-extents will find that display equations bridge every candidate column gap.

Column geometry. Multi-column layouts in publishing tools often include explicit vertical rules, which are path operators drawing a line at the column boundary. These are in getOperatorList(). They are not in getTextContent(). Column detection from text positions is an inference. Column detection from an explicit vertical rule at the same X is a fact.

Reading order. getTextContent() returns items in paint order, not reading order. For a 2-column document, that might be reading order, or it might not, depending on how the PDF was authored. The struct tree, for well-tagged documents, returns leaves in reading order by design.

The Cascade Is Not Optional

The correct architecture is a cascade:

Try getStructTree(). If table regions are present, extract them directly. No column detection needed.
Try getOperatorList() geometry: full-height vertical rules, clip stack. If column rules are present, use them directly. No text-based inference needed.
Fall through to getTextContent() with geometric inference (bipartite partition, stream detection). This is correct for untagged documents with minimal path geometry.

This is not three times the work. Tiers 1 and 2 are fast exits. If the struct tree has tables, you skip all the geometry inference for those zones. If a vertical rule is present, you skip the bipartite algorithm. The fallback (Tier 3) only runs when no higher-fidelity signal is available, which is most documents today, but not most well-authored documents.

The Uncomfortable Part

Running the cascade as a diagnostic on three test PDFs found that all three PDFs fall through to Tier 3. No struct tree, no vertical column rules, in any of them.

This could be read as: the cascade doesn't help for documents people actually use.

The correct reading is: the test suite is three PDFs, and all three happen to be untagged. Amazon earnings releases, Siemens engineering manuals, and LaTeX preprints produce no struct tree output by default. But a PDF exported from Microsoft Word with the "Create bookmarks" option, or from Adobe Acrobat with the accessibility features enabled, or from InDesign with the tagging export, all produce struct trees.

The cascade will be exercised when the document population expands. The diagnostic confirms the fallback is correct. The architecture is in place. The next step is the display-math filter: two lines in the fallback scan that make the LaTeX failure case work without touching anything else.

Read this post in the full Engineering Journal →