Engineering Journal
Pdf Processor
Pdf Processor

Stop Auto-Detecting PDF Columns. Use a Zone Model Instead.

2026-05-15

TLDR: Coverage-histogram gutter detection fails when BOX regions claim text first, when wide bands span both columns, or when page frames match the BOX classifier. After three targeted patches the heuristic was still fragile. The zone model replaces it: each page section stores data-zones JSON with Y-ranges and column counts. A toolbar lets users cycle any zone's column count from 1 to 4. X-based assignment (Math.floor(rx / pageWidth * cols)) replaces histogram guesswork.

Repo: tools/pdf-processor

Why the Histogram Kept Breaking

_detectPageColumns builds an X-coverage histogram across all unclaimed text items and looks for zero-coverage gaps. Three separate failure modes required three separate patches:

  1. Page-frame BOX consumed all text before column detection ran
  2. Wide-band items (spanning both columns at the same Y) hid the gutter
  3. Right-column BOX regions claimed all right-side text, leaving only left-column items in the sample
Each fix added complexity. After all three, edge cases remained. The root problem: the heuristic is trying to infer intent from geometry that was partially consumed before it ran.

The Zone Model

Instead of auto-detecting columns, store the layout explicitly on each page section:

<section class="pdf-page-content"
         data-page="3"
         data-page-width="918"
         data-zones='[{"y0":0,"y1":180,"cols":2},{"y0":180,"y1":320,"cols":1},{"y0":320,"y1":99999,"cols":2}]'>

Each zone is a Y-range with a column count. The assembler groups regions by zone, assigns column index by X position, and renders them in the correct column container. No histogram. No gutter detection.


Auto-Detection as a Default

The first render uses _detectAutoZones to convert the classifier's columnIndex assignments into zone boundaries:

function _detectAutoZones(regions, numCols) {
    const sorted = [...regions].sort((a, b) => a.yCenter - b.yCenter);
    const groups = [];
    let cur = { isFullWidth: sorted[0].columnIndex === -1, list: [sorted[0]] };

for (let i = 1; i < sorted.length; i++) { const fw = sorted[i].columnIndex === -1; if (fw === cur.isFullWidth) { cur.list.push(sorted[i]); } else { groups.push(cur); cur = { isFullWidth: fw, list: [sorted[i]] }; } } groups.push(cur);

return groups.map((g, i) => { const prev = groups[i - 1], next = groups[i + 1]; const y0 = prev ? Math.round((prev.list.at(-1).yCenter + g.list[0].yCenter) / 2) : 0; const y1 = next ? Math.round((g.list.at(-1).yCenter + next.list[0].yCenter) / 2) : 99999; return { y0, y1, cols: g.isFullWidth ? 1 : numCols }; }); }

Consecutive full-width regions → 1-col zone. Consecutive columnar regions → N-col zone. Zone boundaries are midpoints between adjacent groups.


The Zone Toolbar

Each .pdf-region element gets data-ry (Y center) and data-rx (X position) attributes. The zone toolbar reads these to restructure the DOM without re-running extraction.

Cycle column count: Clicking a zone chip cycles its cols value from 1 to 4 and back. applyZones() collects all .pdf-region elements in the zone, groups them by column using Math.floor(rx / pageWidth * cols), and rebuilds the DOM:

const ci = Math.min(Math.floor(r.rx / pageWidth * cols), cols - 1);
colGroups[ci].push(r);

No viewport math. No histogram. Pure division.

Split a zone: + Zone finds the largest Y gap across all zones and splits there. One zone becomes two, each inheriting the original column count. The user can then cycle each independently.


For 2-1-2 and 2-4-2 Patterns

A page with two-column prose, a full-width warning block, and two-column steps again would produce three zones by default: cols:2, cols:1, cols:2. The user changes no setting because the auto-detection reads the classifier output correctly.

For a 2-4-2 pattern (two-column top, four-column comparison grid, two-column bottom), the user clicks + Zone to split at the grid boundary, then clicks the middle zone chip to cycle from 2 to 4.


Self-Contained and Downloadable

data-zones is stored on the section element. The downloaded HTML preserves the zone layout without any sidecar files. Any future viewer that reads the attribute can re-apply the column structure.

Read this post in the full Engineering Journal →