Pdf Processor

Three Ways Multi-Column PDF Parsing Breaks (and the Fixes for Each)

2026-05-15

TLDR: Gutter detection via X-coverage histograms fails in three specific scenarios. (1) Page-frame outer border at x≈0 detected as BOX, eats all page text: guard with bx < 4%W && bw > 65%W. (2) Wide bands span both columns, hiding the gutter: post-correction removes items from fullWidthIndices after a split is found. (3) BOX regions claim right-column text before column detection runs: fallback re-runs _detectPageColumns on all textMeta when unclaimed items give no split.

Repo: tools/pdf-processor

The Pipeline

Column detection in contextClassifier works in order:

Classify lattice tables: claim their text items
Classify image regions: claim their space
Classify BOX regions: claim their enclosed text items
Run _detectPageColumns on remaining unclaimed text items
Patch each region's columnIndex from the found splits

Steps 1-3 happen before step 4. Anything claimed earlier is invisible to the column detector.

Bug 1: The Page-Frame False Positive

Every PDF page has a page border: a rectangle starting at (0,0) spanning nearly the full page width. In PDFs with colored page borders, this rectangle appears in filledRects.

The BOX classifier checks if a rectangle is large enough to contain text. The page border passes this check. It gets classified as a giant BOX region and claims every text item on the page as its content. Step 4 then runs on zero remaining items and finds no split.

The guard:

const _isPageFrame = (bx, bw) =>
    (bx < vpW  0.04 && bw > vpW  0.65) || bw > vpW * 0.88;

Any rectangle starting near the left edge and covering more than 65% of page width is the page border, not a semantic content box.

Bug 2: Wide-Band False Positive

In a two-column layout, some content spans both columns, such as section headings, full-width dividers, and image captions. These appear at Y positions where both columns are active.

The coverage histogram marks an item's full X-span as covered. A full-width heading at Y=300 covers columns 0-900. The histogram sees the full page width occupied at that Y and finds no zero-coverage gap. The heading is added to fullWidthIndices. If enough bands trigger this, no gutter is found at all.

The post-correction loop:

After a split IS found from the non-full-width items, scan fullWidthIndices. If an item sits entirely to one side of the split, it was never truly full-width: remove it from the full-width set and assign it the correct column:

for (let i = fullWidthIndices.length - 1; i >= 0; i--) {
    const tm = textMeta[fullWidthIndices[i]];
    const rightEdge = tm.vx + (tm.vWidth || 0);
    if (rightEdge < splitX)      { fullWidthIndices.splice(i, 1); tm.columnIndex = 0; }
    else if (tm.vx > splitX)     { fullWidthIndices.splice(i, 1); tm.columnIndex = 1; }
}

Bug 3: BOX Claiming Right-Column Text

The most subtle failure. A page has numbered installation steps on the left and NOTICE/IMPORTANT annotation boxes on the right.

The BOX classifier runs before column detection. It finds the NOTICE and IMPORTANT rectangles, classifies them as BOX regions, and assigns all enclosed text items to them. These items leave remainingMeta.

Step 4 now runs _detectPageColumns on only unclaimed items, which are all left-column steps. Left-column items cluster in the X range 60-200. No text appears near the right half. No gutter is found. columnSplits = []. All regions stay at columnIndex = -1. The page renders as a single column, interleaving the NOTICE box into the numbered steps mid-sentence.

The fallback:

if (columnSplits.length === 0) {
    const allNonEmpty = textMeta.filter(tm => tm.str.trim());
    if (allNonEmpty.length > remainingMeta.length + 4) {
        const { splits } = _detectPageColumns(allNonEmpty, viewport, scale);
        columnSplits.push(...splits);
    }
}

If unclaimed items give no split but the full text set is significantly larger, rerun on everything including BOX-claimed items. Their X positions reveal the two-column structure.

All Three on One Real PDF

On a 76-page engineering specification PDF, all three bugs fired on different pages:

Page 1: page-frame outer border claimed all text (Bug 1)
Pages 3-4: wide section headings blocked gutter detection (Bug 2)
Page 5: NOTICE boxes claimed all right-column text (Bug 3)

After all three fixes, each page resolved its column structure correctly. The HTML output preserved the left-column reading order without annotation boxes interrupting numbered installation steps.

The Lesson

Coverage-histogram column detection is fragile because steps 1-3 modify remainingMeta before step 4 runs. Each step that claims items silently distorts the sample that column detection sees.

The fixes restore the correct sample: guard against false BOX classifications (Bug 1), correct false positives after the split is found (Bug 2), and fall back to the full sample when the reduced one gives no result (Bug 3).

Read this post in the full Engineering Journal →