Tier Restructure Errorfix
Error Fix: Why the LaTeX PDF Produces Zero Column Splits
TLDR: The column detector found 0 splits on a clearly 2-column academic paper. The assumed cause (math glyph advance widths miscalibrating the font size baseline) was wrong. The real cause was that entire equation blocks arrive from getTextContent() as single items wide enough to bridge the column gutter — making every candidate split point look equally blocked.
The Symptom
Running the bipartite column detection algorithm on raiko-aistats-12.pdf (a 2-column LaTeX academic paper): 0 splits detected on every page. The expected result is one split at approximately X≈310.
The Wrong Diagnosis
LaTeX documents typeset mathematics using font glyphs with typographic advance widths that include italic corrections and spacing. The assumption was:
- Many small math characters per page → median font size pulled below body text size
- PageScale
S = median(vFont)returns ~8pt instead of ~10pt - All thresholds derived from
Sare miscalibrated (too tight) - The column gap threshold fails to register the gutter
The diagnostic ran mode-S and median-S independently for every page of every test document. On raiko-aistats:
Page 1: mode=9.5 median=9.4 divergence=0.1pt
Page 2: mode=9.5 median=9.4 divergence=0.1pt
Page 3: mode=9.5 median=9.4 divergence=0.1pt
...
A 0.1pt divergence does not explain a threshold failure. The calibration hypothesis is wrong for this document.
Finding the Real Cause
After ruling out calibration, the diagnostic printed item widths for each page. Display-math blocks — equations typeset in the center of the page — appeared as single text items with vWidth values of 400–500px on a ~620px viewport. Normal paragraph text items had widths of 20–120px.
The bipartite algorithm's fallback path: when interval merge produces no candidate gap (because some items span the full width), it falls back to a minimum-crossing scan. For each candidate X value, count how many text items cross that X. The X with the lowest crossing count is selected as the split.
But with one or two display-math items spanning the full page width:
- Each of those items crosses X≈100, X≈200, X≈310, X≈400, and every other candidate
- The crossing count at every candidate X includes these wide items
- No candidate has a lower crossing count than any other
- The fallback scan selects nothing → 0 splits
The Fix
Pre-filter the item list before the fallback crossing scan. Items where vWidth > S * 4 are anomalously wide relative to the body font size. On a document with S=9.5pt, that threshold is roughly 9.5 × 4 × viewport_scale ≈ 190px. Display-math blocks at 400–500px are well above this. Normal text items at 20–120px are well below.
// In _detectPageColumns, before the fallback crossing scan
const narrowItems = textMeta.filter(tm => tm.vWidth <= S * 4);
// use narrowItems for crossing count, not textMeta
After this filter, the crossing scan sees only normal paragraph-width items. The gap at X≈310 is clear. Splits are detected.
This filter applies only to the fallback crossing scan. The interval merge stage runs on all items. The three gates (population ≥ 3, local commitment ≥ 40%, vertical persistence) run on all items. Only the fallback scan — which runs when interval merge finds no gap — gets the narrow-only view.
The Secondary Bug: reconcile() Argument Count
During diagnostic test construction, the reconciler was called with two arguments:
// Wrong
const { segments, filledRects } = reconcile(subpaths, viewport);
The actual signature is:
// Correct
reconcile(subpaths, rawFilledRects, viewport)
The error message was TypeError: Cannot read properties of undefined (reading 'transform') at pathReconciler.js:230. The viewport parameter was being read from position 2, but position 1 (rawFilledRects) was receiving the viewport object and position 2 was undefined.
Fix: pass filledRects as the second argument.
const { segments, filledRects } = reconcile(subpaths, filledRects, viewport);
This is a test scaffolding bug, not a pipeline bug. The main pipeline in geometryWorker.js already passes all three arguments correctly.
What This Tells Us About the Test Suite
The two working test documents (AMZN, 59MN7C) do not have display-math blocks. The cases that break the fallback crossing scan are document-specific: any PDF that has full-width text items spanning the gutter. This includes:
- Display-math equations in LaTeX papers
- Full-width pull quotes in magazine layouts
- Wide captions under double-column figures