Pdf Processor

Tier Restructure Postmortem

2026-07-11

Post-Mortem: Diagnosing a PDF Pipeline That Works for Two PDFs and Fails on the Third

TLDR: We stress-tested a deterministic column detection pipeline against a LaTeX academic paper. The hypothesis going in was wrong. The real failure mode was found in the data, not in the algorithm. Here is what we assumed, what we found, and where the work actually needs to happen.

The Setup

The PDF extraction pipeline uses a bipartite band partition algorithm to detect 2-column layouts. It was built against two test documents: an Amazon earnings release (single-column financial) and a Siemens engineering manual (2-column, rich path geometry). Both work correctly.

The third document, a LaTeX academic paper, was always expected to be a harder case. But the failure mode we expected was wrong.

The Wrong Hypothesis

LaTeX PDFs have a lot of math. Math characters are typeset using font metrics where the advance width (how far the cursor moves after placing a glyph) does not match the ink width (how wide the character actually is). The assumption was:

getTextContent() text items have advance-width-based widths
Math glyphs have inflated advances
Many math items → median font size pulled toward math sizes (~7pt)
All thresholds calibrated from the body font size (PageScale S) are miscalibrated
Column detection breaks because the gutter threshold is wrong

This is a reasonable hypothesis. It is also completely incorrect for this document.

The diagnostic measured median S and mode S for every page. The divergence was 0.1pt. On some pages, mode was 9.5pt and median was 9.4pt. These are not meaningfully different. The calibration path is not the failure.

The Real Failure

After ruling out calibration, we looked at the actual items being processed.

For a standard 2-column LaTeX paper, the column gutter is roughly at X≈310 (out of a ~620px viewport). The bipartite algorithm's fallback path, which runs when interval merge finds no clean gap, walks candidate X values and counts how many items cross each. The split point should be the X where the fewest items cross.

On raiko-aistats-12.pdf, the fallback finds that every candidate X has the same high crossing count. Zero splits detected.

Why? PDF.js getTextContent() for this document does not expose individual math characters as separate items. Entire display-math equations arrive as single text items. A display-math block is full-width: it spans from the left column's left edge to the right column's right edge, crossing X≈310 along with X≈100, X≈150, and every other candidate.

When the fallback scan runs, it counts one or two wide equation items as crossing all candidates. No candidate has a lower crossing count than any other. No split is selected.

The gutter is real. The columns are real. The equations just happen to sit across the gutter and overwhelm the crossing count.

What Survived

The interval merge stage is correct. It correctly finds no clean gap because the equation items span the full page width in getTextContent(): there is no gap to find in the item X-extents. The problem is not in interval merge; it is in how the fallback scan treats anomalously wide items.

The calibration work (mode-S vs median-S) is still worth doing. The 0.1pt divergence on these three documents doesn't mean the fix is unnecessary. It means these three documents happen not to stress the calibration path. A document with 60% subscripts would.

The three-tier architecture (getStructTree → getOperatorList → getTextContent) is correct as a long-term model. But for these three specific documents, all three tiers resolve to the same thing: Tier 3 (bipartite fallback) is the only active path. The struct tree is absent from all three. Full-height vertical column rules are absent from all three.

The Actual Fix

The fix is a pre-filter on the fallback crossing scan. Items where vWidth > S * 4 are anomalously wide relative to the body font size. They are display-math blocks, full-width images, or full-width headers, not normal paragraph text. They should not contribute to the crossing count in the fallback scan because they are not evidence about where the column boundary is.

Filter them before the crossing scan runs. The surviving narrow items will have the correct gap pattern. Splits will be found.

This is a two-line change in _detectPageColumns. It does not touch the interval merge path, the three gates, or the bipartite structure.

What the Architecture Document Changed

Before this session: the pipeline had one active code path for all documents, calibrated by a median that is theoretically wrong for math-heavy documents.

After this session: the architecture document defines a three-tier model, a diagnostic harness confirms which tier is active per document, and the specific failure mode for LaTeX papers is root-caused to item width anomalies, not calibration.

The heavy restructure (structTreeReader, ctmAdapter extensions, tiered classifyPage) is still ahead. But the next actionable fix, the one that makes raiko-aistats work, is the display-math pre-filter, not the restructure.

Read this post in the full Engineering Journal →