Pdf Processor

Tier Restructure Errorfix

2026-07-11

Error Fix: Why the LaTeX PDF Produces Zero Column Splits

TLDR: The column detector found 0 splits on a clearly 2-column academic paper. The assumed cause (math glyph advance widths miscalibrating the font size baseline) was wrong. The real cause was that entire equation blocks arrive from getTextContent() as single items wide enough to bridge the column gutter, making every candidate split point look equally blocked.

The Symptom

Running the bipartite column detection algorithm on raiko-aistats-12.pdf (a 2-column LaTeX academic paper): 0 splits detected on every page. The expected result is one split at approximately X≈310.

The Wrong Diagnosis

LaTeX documents typeset mathematics using font glyphs with typographic advance widths that include italic corrections and spacing. The assumption was:

Many small math characters per page → median font size pulled below body text size
PageScale S = median(vFont) returns ~8pt instead of ~10pt
All thresholds derived from S are miscalibrated (too tight)
The column gap threshold fails to register the gutter

This is plausible because the mechanism is real: median IS pulled by math characters on some documents. The diagnostic was built partly to test this hypothesis.

The diagnostic ran mode-S and median-S independently for every page of every test document. On raiko-aistats:

Page 1: mode=9.5  median=9.4  divergence=0.1pt
Page 2: mode=9.5  median=9.4  divergence=0.1pt
Page 3: mode=9.5  median=9.4  divergence=0.1pt
...

A 0.1pt divergence does not explain a threshold failure. The calibration hypothesis is wrong for this document.

Finding the Real Cause

After ruling out calibration, the diagnostic printed item widths for each page. Display-math blocks, which are equations typeset in the center of the page, appeared as single text items with vWidth values of 400–500px on a ~620px viewport. Normal paragraph text items had widths of 20–120px.

The bipartite algorithm's fallback path: when interval merge produces no candidate gap (because some items span the full width), it falls back to a minimum-crossing scan. For each candidate X value, count how many text items cross that X. The X with the lowest crossing count is selected as the split.

But with one or two display-math items spanning the full page width:

Each of those items crosses X≈100, X≈200, X≈310, X≈400, and every other candidate
The crossing count at every candidate X includes these wide items
No candidate has a lower crossing count than any other
The fallback scan selects nothing → 0 splits

The gutter at X≈310 is structurally real. Normal paragraph text on the left ends before X≈310. Normal text on the right starts after X≈310. But the equation items drown out this signal.

The Fix

Pre-filter the item list before the fallback crossing scan. Items where vWidth > S * 4 are anomalously wide relative to the body font size. On a document with S=9.5pt, that threshold is roughly 9.5 × 4 × viewport_scale ≈ 190px. Display-math blocks at 400–500px are well above this. Normal text items at 20–120px are well below.

// In _detectPageColumns, before the fallback crossing scan
const narrowItems = textMeta.filter(tm => tm.vWidth <= S * 4);
// use narrowItems for crossing count, not textMeta

After this filter, the crossing scan sees only normal paragraph-width items. The gap at X≈310 is clear. Splits are detected.

This filter applies only to the fallback crossing scan. The interval merge stage runs on all items. The three gates (population ≥ 3, local commitment ≥ 40%, vertical persistence) run on all items. Only the fallback scan, which runs when interval merge finds no gap, gets the narrow-only view.

The Secondary Bug: reconcile() Argument Count

During diagnostic test construction, the reconciler was called with two arguments:

// Wrong
const { segments, filledRects } = reconcile(subpaths, viewport);

The actual signature is:

// Correct
reconcile(subpaths, rawFilledRects, viewport)

The error message was TypeError: Cannot read properties of undefined (reading 'transform') at pathReconciler.js:230. The viewport parameter was being read from position 2, but position 1 (rawFilledRects) was receiving the viewport object and position 2 was undefined.

Fix: pass filledRects as the second argument.

const { segments, filledRects } = reconcile(subpaths, filledRects, viewport);

This is a test scaffolding bug, not a pipeline bug. The main pipeline in geometryWorker.js already passes all three arguments correctly.

What This Tells Us About the Test Suite

The two working test documents (AMZN, 59MN7C) do not have display-math blocks. The cases that break the fallback crossing scan are document-specific: any PDF that has full-width text items spanning the gutter. This includes:

Display-math equations in LaTeX papers
Full-width pull quotes in magazine layouts
Wide captions under double-column figures

The fix (vWidth > S * 4 filter) correctly identifies all of these as non-column-discriminating items. It does not affect normal two-column text detection because body text items are never this wide.

Read this post in the full Engineering Journal →