Pdf Processor

Why Zero-Coverage Gap Detection Fails for Multi-Column PDFs (And What to Do Instead)

2026-05-11

The Problem

You have a two-column PDF page. Every text item has a precise X position in viewport space. You want to find the empty strip between the two columns, the gutter, so you can split the page into a left side and a right side before building the HTML.

The straightforward approach: build a coverage array across the page width. For every text item, mark every pixel it spans. Find a range with zero coverage. That is the gutter.

This works for toy examples. It fails immediately on real engineering documents.

The Naive Attempt

const coverage = new Float32Array(pageWidth);
for (const item of textItems) {
    const x1 = Math.floor(item.vx);
    const x2 = Math.ceil(item.vx + item.vWidth);
    for (let x = x1; x <= x2; x++) coverage[x]++;
}
// find first range where coverage[x] === 0 for minGap pixels

For a page with clean two-column text and nothing else, this works. The left column covers x=54 to x=440. The right column covers x=473 to x=865. The gap at x=440 to x=473 has zero coverage. Split at x=456.

On a real HVAC installation manual with warning boxes, spanning headings, and instructional paragraphs, the coverage array has no gaps at all. Every pixel from x=54 to x=900 is covered by something.

Why It Broke

Problem 1: One item spans the full width.

A WARNING paragraph contains 37 text items. The first item is at x=57. The last item is at x=764. Together they cover the full page. The gutter at x=440 to x=473 gets filled by the warning text that straddles it.

First fix attempt: filter out items wider than 55% of the page before building coverage. Items with vWidth >= pageWidth * 0.55 are treated as full-width content and excluded.

This does not fix it, because the warning items are each narrow. A single word is maybe 60 pixels wide. The items are not wide. The line they form is wide.

Problem 2: Bands aggregate across column boundaries.

A Y-band is a group of text items at the same visual line height, within a few pixels of each other in the Y direction. In a two-column layout, items from both columns that happen to share the same visual row land in the same Y-band. The left-column item at y=192, x=54 and the right-column item at y=191, x=473 are in the same band because 192 minus 191 is 1 pixel, which is within the 5.4px Y tolerance.

The band's total X span is 473 plus its width minus 54, which is easily 400+ pixels. The band looks wide even though neither of its items individually is.

Problem 3: Dense bands overpower sparse bands.

Even after the wide-band filter, some bands have 8 items in the left column while the right column has 1 item in a band. If you count items, the left column gets 8x more coverage weight than the right column at the same X pixel. This distorts where the valleys appear.

The Math Behind the Fix

The key insight: you need to count how many distinct visual lines have content near each X pixel, not how many items.

Define bandCount[x] as the number of narrow Y-bands that have at least one item covering pixel x. "Narrow band" means a band whose total X span is less than 55% of the page width.

const bandCount = new Float32Array(pageWidth);
for (const band of narrowBands) {
    const seen = new Uint8Array(pageWidth);  // boolean, one per band
    for (const item of band.items) {
        const x1 = Math.max(0, Math.floor(item.vx));
        const x2 = Math.min(pageWidth - 1, Math.ceil(item.vx + item.vWidth));
        for (let x = x1; x <= x2; x++) seen[x] = 1;
    }
    for (let x = 0; x < pageWidth; x++) bandCount[x] += seen[x];
}

The critical difference from the item-count approach: seen[x] = 1 not seen[x]++. A band with 10 items covering the same range contributes exactly the same as a band with 1 item. Dense bands and sparse bands are equal votes.

The gutter threshold:

A column gutter is an X range where bandCount[x] < N * 0.20, where N is the number of narrow bands.

Why 20%? In a two-column layout with balanced content:

Each column occupies roughly 40-50% of the bands
The gutter has near-zero bands (items from one column that extend slightly past their edge)
A threshold of 20% is low enough to capture the gutter but high enough to survive a few stray items that cross the column edge

Empirical validation on the HVAC manual page 4:

52 narrow bands
Threshold: 52 * 0.20 = 10.4
bandCount at x=200: 19 (dense left column)
bandCount at x=460: 1 (the gutter -- only one stray band reaches this far)
bandCount at x=473: 15 (right column starts here)

Gap found at x=444 to x=472, 28 pixels wide. Split midpoint at 458.

The Wide-Band Filter

Before the gap detection, wide bands (Y-bands with total X span > 55% of page width) are separated from narrow bands. Their items are marked as fullWidthIndices.

const WIDE_BAND_FRAC = 0.55;
const fullWidthIndices = new Set();
const narrowBands = [];
for (const band of bands) {     let minX = Infinity, maxX = -Infinity;     for (const item of band.items) {         if (item.vx < minX) minX = item.vx;         const re = item.vx + (item.vWidth || 0);         if (re > maxX) maxX = re;     }     if (maxX - minX > pageWidth * WIDE_BAND_FRAC) {         for (const item of band.items) fullWidthIndices.add(item.idx);     } else {         narrowBands.push(band);     } }

Items in fullWidthIndices are classified as columnIndex: -1 (full-width zone dividers). They get emitted outside the column wrapper in the HTML assembler.

Why 55%? It needs to be:

Low enough to catch lines where both columns contribute to the same visual row (combined span ~50%)
High enough not to catch items from just one column in an asymmetric layout (a 60/40 split has a left column spanning 60% of the page, which would be incorrectly excluded at a lower threshold)

55% threads that needle for standard two-column engineering documents. A 60/40 layout would have the left column taking up to x=550 out of x=918, which is 59.9%. With a 55% threshold, the left column items would be incorrectly marked as full-width on some lines. This is a known limitation of the current implementation.

What Gets Tagged and What Does Not

Every region in the output now carries a columnIndex:

| Region type | columnIndex | |---|---| | Table, bbox.w >= 65% of page | -1 (full-width) | | Table, bbox.w < 65% of page | patched to column N from bbox center | | Image | -1 (full-width, default) | | Text from wide band | -1 (zone divider) | | Text from narrow band, vx < split | 0 (left column) | | Text from narrow band, vx >= split | 1 (right column) |

The region sort order (by yCenter) is preserved. The assembler reads this sequence and outputs:

ci=-1  →  emit full-width
ci=0   →  start buffering left column
ci=1   →  start buffering right column
ci=-1  →  flush both column buffers into a pdf-row div, then emit full-width
ci=0   →  start new column zone...

Tradeoffs and Known Limitations

The 20% threshold is empirical. It was validated on one HVAC installation manual. A document with very few column-specific bands (say, a page that is mostly full-width text with just a few column items) would have a low total narrow band count, making the 20% threshold very permissive. A page with 5 narrow bands has a threshold of 1.0, meaning even 2 bands covering the gutter would not be classified as a gap.

The 55% wide-band filter breaks for asymmetric layouts. A page where the left column takes up 60% of the width will have left-column bands spanning 60%, above the 55% threshold, and those bands would be incorrectly excluded from the narrow set. The gutter might then be detected at the wrong position or not at all.

Same-row items from both columns always form mixed bands. This is unavoidable with the Y-band approach. The per-band counting handles it correctly for the gap detection, but the band still exists as a "mixed" entity. If the band contains an item at x=57 (left) and an item at x=490 (right) and their combined span is below 55%, both items go into the narrow band set and into the coverage of their respective column ranges, with the gutter in between still clear.

What is not yet done: the HTML assembler (pageAssembler.js) still emits regions in flat order. The Y-zone segmentation step, which reads the columnIndex sequence and wraps column zones in <div class="pdf-row"> with <div class="pdf-col"> children, is the next implementation step.

Read this post in the full Engineering Journal →