Tier Restructure Deepdive
Under the Hood: The Three-Tier PDF Extraction Model
TLDR: Every manufactured PDF (not scanned) has three fidelity levels available to a browser-side extractor: a semantic structure tree, a geometric paint stream, and a text convenience API derived from the paint stream. Most tools use only the third. The correct architecture reads them top-down and exits as soon as a tier produces a complete answer.
The API Hierarchy
PDF.js exposes three document-reading APIs. They are not three ways of reading the same data. They are three different data sources at different levels of the PDF format.
getTextContent()
This is a convenience API. It returns positioned text items: { str, transform, width, height } for every glyph run on the page. You get text, position, and a typographic advance width.
The important fact: this API is derived from getOperatorList(). PDF.js processes the operator list internally, collects text paint operators (Tj, TJ, ', "), applies the current text matrix and CTM, and packages the results as text items. getTextContent() does not read a different part of the PDF — it is a processed view of the paint stream.
This derivation has a cost: advance widths are typographic, not ink widths. For most text, these are the same. For math, they are not. A subscript character with a large italic correction has an advance width that includes white space intended for the next character. Equation blocks composed from individual glyphs may have items whose total advance width is wider than their actual ink.
More critically for extraction: PDF.js may not produce individual items for every character in a math equation. Display-math blocks from LaTeX can arrive as single items spanning the full equation width. The individual character positions are not surfaced.
getOperatorList()
This is the raw paint stream. Every drawing command the PDF renderer would execute: move-to, line-to, curve-to, rectangle, fill, stroke, set-color, set-font, save, restore. The CTM stack is implicit — q pushes, Q pops, cm multiplies into the current matrix.
This is the ground truth for geometry. Table lines, box borders, background fills, column rules — they are all explicit path operators here. Nothing is inferred.
Text paint operators (Tj, TJ) appear in the operator list with the current text matrix applied. This is where getTextContent() reads its data. The difference: in the operator list, each paint operation is also wrapped in BMC/BDC...EMC marked content blocks that carry a Marked Content ID (MCID).
getStructTree()
This is not derived from the paint stream. It is a separate data structure stored in the PDF cross-reference table — the logical structure tree. It encodes the semantic role of every painted element: Table, TR, TD, TH, P, H1–H6, Figure, Formula, L (list), LI.
Each leaf node in the structure tree carries an MCID. Each BMC/BDC operator in the operator list also carries an MCID. Joining the two gives you: every glyph run → its semantic role.
This is the data source that most extractors never read.
The MCID Join
The join between structure tree and operator list is the technical centerpiece of Tier 1.
Walk the operator list once. Maintain a MCID stack: push on BDC, pop on EMC. When you encounter a text paint operator (Tj, TJ), record the current top-of-stack MCID.
for (let i = 0; i < opList.fnArray.length; i++) {
const fn = opList.fnArray[i];
const args = opList.argsArray[i];
if (fn === OPS.beginMarkedContentProps) { const props = args[1]; if (props && props.MCID !== undefined) { mcidStack.push(props.MCID); } } else if (fn === OPS.endMarkedContent) { mcidStack.pop(); } else if (fn === OPS.showText || fn === OPS.showSpacedText) { const currentMcid = mcidStack[mcidStack.length - 1]; if (currentMcid !== undefined) { opIndexToMcid.set(i, currentMcid); } } }
Then walk the structure tree. Each Table node contains TR nodes, which contain TD nodes. Each TD node has an MCID. Map that MCID to the text items collected above.
Result: every text item has a semantic role. Tables fall out as TD nodes grouped by TR grouped by Table. No column detection. No stream detection. No threshold.
Tier 2: What ctmAdapter Is Already Close To Providing
The current pipeline reads the operator list via ctmAdapter.js, which emits subpath records and filled rectangles. It already collects vSegs (vertical segments).
The missing piece: nobody checks whether any vSeg spans the full content height. A vertical rule in an engineering manual or newsletter that runs from top to bottom of the content area is explicit column geometry. It is more reliable than any inference from text positions.
const contentHeight = contentBottom - contentTop;
const columnRules = vSegs.filter(s => {
const len = Math.abs(s.y2 - s.y1);
const midX = (s.x1 + s.x2) / 2;
return len >= contentHeight * 0.60
&& midX >= vpWidth * 0.10
&& midX <= vpWidth * 0.90;
});
If columnRules.length > 0, their X positions are used directly as column splits. The bipartite algorithm is skipped entirely. A geometric fact is used as a geometric fact.
Two other operator list signals are currently discarded:
- Clip stack:
W/W*operators define clip regions. Some PDFs clip each column to its column rectangle. Adjacent clip regions with a gap encode a column layout without any inference. - Paint order: a text item painted after a filled rectangle is visually on top of that rectangle. When two filled regions overlap the same text item, paint order disambiguates which region the text belongs to.
Mode-S vs Median-S
PageScale S is the body font size — the calibration constant from which all thresholds are derived. Currently: S = median(vFont) across all text items on the page.
For LaTeX papers with subscripts, superscripts, and equation characters, the distribution of font sizes is not unimodal. There is a cluster of body-text items at 10pt and a long tail of math characters at 6–7pt. The median of this distribution is pulled toward the tail.
The fix is the mode, computed on 0.5pt bins:
const bins = new Map();
for (const tm of textMeta) {
const bin = Math.round(tm.vFont * 2) / 2;
bins.set(bin, (bins.get(bin) || 0) + 1);
}
let modeFont = 12, modeCount = 0;
for (const [bin, count] of bins) {
if (count > modeCount) { modeCount = count; modeFont = bin; }
}
const S = modeFont;
The mode is the body text size on any well-structured document. Subscripts are a tail in the frequency distribution, not the center of mass.
What the Diagnostic Found
Running all three test PDFs through a read-only diagnostic harness produced:
| PDF | Struct tree | Column rules | Mode-S vs Median-S | Tier 3 result | |-----|-------------|--------------|---------------------|---------------| | AMZN (financial) | Absent | None | Identical | 0 splits ✓ | | 59MN7C (engineering) | Absent | None found (diagnostic param bug) | Identical | Splits detected ✓ | | raiko-aistats (LaTeX) | Absent | None | 0.1pt divergence | 0 splits ✗ |
The tiered architecture is correct in design. For the current test documents, all three PDFs fall through to Tier 3. The struct tree is not used because it is not present. Column rules are not used because none are present.
The failure on raiko-aistats is entirely within Tier 3: the fallback crossing scan is not pre-filtering anomalously wide items. That is the next fix.
The Hierarchy Summary
getStructTree() — semantic ground truth, MCID-joined to paint stream
getOperatorList() — geometric ground truth, path operators + text operators
getTextContent() — derived convenience, lossy (no MCID, typographic widths)
Every well-structured PDF authored in Word, InDesign, or a publishing tool has a populated struct tree. Every manufactured PDF (not scanned) has a full operator list. getTextContent() is the right tool for 80% of cases where you just need text and approximate positions.
For extraction that needs to correctly classify tables, columns, and reading order across arbitrary document types, you need all three APIs and a cascade strategy for which one to trust.