Why Zero-Coverage Gap Detection Fails for Multi-Column PDFs (And What to Do Instead)
The Problem
You have a two-column PDF page. Every text item has a precise X position in viewport space. You want to find the empty strip between the two columns, the gutter, so you can split the page into a left side and a right side before building the HTML.
The straightforward approach: build a coverage array across the page width. For every text item, mark every pixel it spans. Find a range with zero coverage. That is the gutter.
This works for toy examples. It fails immediately on real engineering documents.
The Naive Attempt
const coverage = new Float32Array(pageWidth);
for (const item of textItems) {
const x1 = Math.floor(item.vx);
const x2 = Math.ceil(item.vx + item.vWidth);
for (let x = x1; x <= x2; x++) coverage[x]++;
}
// find first range where coverage[x] === 0 for minGap pixels
For a page with clean two-column text and nothing else, this works. The left column covers x=54 to x=440. The right column covers x=473 to x=865. The gap at x=440 to x=473 has zero coverage. Split at x=456.
On a real HVAC installation manual with warning boxes, spanning headings, and instructional paragraphs, the coverage array has no gaps at all. Every pixel from x=54 to x=900 is covered by something.
Why It Broke
Problem 1: One item spans the full width.
A WARNING paragraph contains 37 text items. The first item is at x=57. The last item is at x=764. Together they cover the full page. The gutter at x=440 to x=473 gets filled by the warning text that straddles it.
First fix attempt: filter out items wider than 55% of the page before building coverage. Items with vWidth >= pageWidth * 0.55 are treated as full-width content and excluded.
This does not fix it, because the warning items are each narrow. A single word is maybe 60 pixels wide. The items are not wide. The line they form is wide.
Problem 2: Bands aggregate across column boundaries.
A Y-band is a group of text items at the same visual line height, within a few pixels of each other in the Y direction. In a two-column layout, items from both columns that happen to share the same visual row land in the same Y-band. The left-column item at y=192, x=54 and the right-column item at y=191, x=473 are in the same band because 192 minus 191 is 1 pixel, which is within the 5.4px Y tolerance.
The band's total X span is 473 plus its width minus 54, which is easily 400+ pixels. The band looks wide even though neither of its items individually is.
Problem 3: Dense bands overpower sparse bands.
Even after the wide-band filter, some bands have 8 items in the left column while the right column has 1 item in a band. If you count items, the left column gets 8x more coverage weight than the right column at the same X pixel. This distorts where the valleys appear.
The Math Behind the Fix
The key insight: you need to count how many distinct visual lines have content near each X pixel, not how many items.
Define bandCount[x] as the number of narrow Y-bands that have at least one item covering pixel x. "Narrow band" means a band whose total X span is less than 55% of the page width.
const bandCount = new Float32Array(pageWidth);
for (const band of narrowBands) {
const seen = new Uint8Array(pageWidth); // boolean, one per band
for (const item of band.items) {
const x1 = Math.max(0, Math.floor(item.vx));
const x2 = Math.min(pageWidth - 1, Math.ceil(item.vx + item.vWidth));
for (let x = x1; x <= x2; x++) seen[x] = 1;
}
for (let x = 0; x < pageWidth; x++) bandCount[x] += seen[x];
}
The critical difference from the item-count approach: seen[x] = 1 not seen[x]++. A band with 10 items covering the same range contributes exactly the same as a band with 1 item. Dense bands and sparse bands are equal votes.
The gutter threshold:
A column gutter is an X range where bandCount[x] < N * 0.20, where N is the number of narrow bands.
Why 20%? In a two-column layout with balanced content:
- Each column occupies roughly 40-50% of the bands
- The gutter has near-zero bands (items from one column that extend slightly past their edge)
- A threshold of 20% is low enough to capture the gutter but high enough to survive a few stray items that cross the column edge
- 52 narrow bands
- Threshold: 52 * 0.20 = 10.4
- bandCount at x=200: 19 (dense left column)
- bandCount at x=460: 1 (the gutter -- only one stray band reaches this far)
- bandCount at x=473: 15 (right column starts here)
The Wide-Band Filter
Before the gap detection, wide bands (Y-bands with total X span > 55% of page width) are separated from narrow bands. Their items are marked as fullWidthIndices.
const WIDE_BAND_FRAC = 0.55;
const fullWidthIndices = new Set();
const narrowBands = [];
for (const band of bands) { let minX = Infinity, maxX = -Infinity; for (const item of band.items) { if (item.vx < minX) minX = item.vx; const re = item.vx + (item.vWidth || 0); if (re > maxX) maxX = re; } if (maxX - minX > pageWidth * WIDE_BAND_FRAC) { for (const item of band.items) fullWidthIndices.add(item.idx); } else { narrowBands.push(band); } }
Items in fullWidthIndices are classified as columnIndex: -1 (full-width zone dividers). They get emitted outside the column wrapper in the HTML assembler.
Why 55%? It needs to be:
- Low enough to catch lines where both columns contribute to the same visual row (combined span ~50%)
- High enough not to catch items from just one column in an asymmetric layout (a 60/40 split has a left column spanning 60% of the page, which would be incorrectly excluded at a lower threshold)
What Gets Tagged and What Does Not
Every region in the output now carries a columnIndex:
| Region type | columnIndex | |---|---| | Table, bbox.w >= 65% of page | -1 (full-width) | | Table, bbox.w < 65% of page | patched to column N from bbox center | | Image | -1 (full-width, default) | | Text from wide band | -1 (zone divider) | | Text from narrow band, vx < split | 0 (left column) | | Text from narrow band, vx >= split | 1 (right column) |
The region sort order (by yCenter) is preserved. The assembler reads this sequence and outputs:
ci=-1 → emit full-width
ci=0 → start buffering left column
ci=1 → start buffering right column
ci=-1 → flush both column buffers into a pdf-row div, then emit full-width
ci=0 → start new column zone...
Tradeoffs and Known Limitations
The 20% threshold is empirical. It was validated on one HVAC installation manual. A document with very few column-specific bands (say, a page that is mostly full-width text with just a few column items) would have a low total narrow band count, making the 20% threshold very permissive. A page with 5 narrow bands has a threshold of 1.0, meaning even 2 bands covering the gutter would not be classified as a gap.
The 55% wide-band filter breaks for asymmetric layouts. A page where the left column takes up 60% of the width will have left-column bands spanning 60%, above the 55% threshold, and those bands would be incorrectly excluded from the narrow set. The gutter might then be detected at the wrong position or not at all.
Same-row items from both columns always form mixed bands. This is unavoidable with the Y-band approach. The per-band counting handles it correctly for the gap detection, but the band still exists as a "mixed" entity. If the band contains an item at x=57 (left) and an item at x=490 (right) and their combined span is below 55%, both items go into the narrow band set and into the coverage of their respective column ranges, with the gutter in between still clear.
What is not yet done: the HTML assembler (pageAssembler.js) still emits regions in flat order. The Y-zone segmentation step, which reads the columnIndex sequence and wraps column zones in <div class="pdf-row"> with <div class="pdf-col"> children, is the next implementation step.