Bipartite Column Detection Deepdive
How to Detect PDF Columns Without Lying to Yourself
TLDR: PDF column detection fails when it treats spatial gaps as structural evidence. The correct model is a graph-theoretic separator: find an X value that partitions text bands into two populations with no crossing members. This post walks through the math, the failure modes of histogram approaches, and the final interval-merge + three-gate implementation.
The Problem Space
A PDF page is a flat list of positioned text items. Each item has an X position and a width. To detect multi-column layout, you need to find vertical splits where the left column's content and the right column's content don't overlap horizontally.
That sounds easy. In practice, every naive approach fails on at least one of these:
- Bullet-indented text (lone
•glyphs create real coverage gaps on the left) - Financial table column headers (narrow, right-side-only, top of page)
- Wide left margins (single-column prose with large indent)
- Pages that are 2-column in one section and single-column in another
Why Histogram Valley Scanning Fails
The standard approach builds a per-pixel coverage count: for each X in [0, pageWidth], count how many text bands cover that pixel. Find ranges where the count drops below some threshold (say, 20% of total bands). Call those ranges column gutters.
const bandCount = new Float32Array(pageWidth);
for (const band of narrowBands) {
for (const tm of band.items) {
const x1 = Math.floor(tm.vx);
const x2 = Math.ceil(tm.vx + tm.vWidth);
for (let x = x1; x <= x2; x++) bandCount[x]++;
}
}
const threshold = narrowBands.length * 0.20;
// find ranges where bandCount[x] < threshold
The failure is that this asks a magnitude question: "is coverage low here?" Low coverage is a necessary condition for a column gutter but not a sufficient one. A bullet indent creates real low coverage. A right-side financial column header creates real low coverage. A wide left margin creates real low coverage. None of these are column gutters.
The fix requires asking a structural question: "does this X value partition the band population?"
The Bipartite Partition Model
For a candidate split at position X with gutter half-width tol:
leftBands = bands where maxX <= X - tol // end before the gutter starts
rightBands = bands where minX >= X + tol // start after the gutter ends
crossBands = everything else
The key direction: tol shrinks the allowed space inward. A band is "committed left" only if it ends at least tol pixels before the split. If we used <= X + tol, a band ending 10px past the split center would still count as committed left -- while a band starting at X - 10px counts as committed right. Those bands overlap by 20px. The original v1 algorithm made exactly this mistake.
A split is valid if all three gates pass.
Gate 1: Population on Both Sides
if (leftOnly.length < 3 || rightOnly.length < 3) continue;
Simple count gate. A lone bullet character produces exactly 0 right-side bands. A single financial column header cluster produces 1-2 left-side bands. Both fail immediately without any magnitude measurement.
Gate 2: Local Commitment Ratio
const coexistTop = Math.max(Math.min(...leftOnly.map(b => b.y)), Math.min(...rightOnly.map(b => b.y)));
const coexistBottom = Math.min(Math.max(...leftOnly.map(b => b.y)), Math.max(...rightOnly.map(b => b.y)));
if (coexistBottom < coexistTop) continue; // populations are vertically disjoint
const localBands = narrowBands.filter(b => b.y >= coexistTop && b.y <= coexistBottom);
if ((leftOnly.length + rightOnly.length) / localBands.length < 0.40) continue;
At least 40% of bands in the coexistence zone must commit cleanly to one side.
The critical design choice is the denominator. Using total narrowBands.length fails when a page is single-column for the top 70% and 2-column for the bottom 30%: the committed bands are 30% of total, below the 40% gate.
The correct denominator is localBands -- bands within the Y-range intersection of the left and right populations. The coexistence span is [max(leftMinY, rightMinY), min(leftMaxY, rightMaxY)] -- where both populations are simultaneously present vertically.
This also handles the outlier contamination problem. A stray single-column line at Y=800 satisfies the X predicate (maxX <= X - tol) and would be added to leftOnly. With a union Y-span, this extends localTop to Y=800, pulling all single-column text below the 2-column section into localBands. The commitment ratio collapses. With intersection Y-span, if the right-side population ends at Y=400, coexistBottom = 400. The Y=800 outlier is outside the coexistence zone and never counted.
If coexistBottom < coexistTop, the two populations are vertically disjoint -- they appear in sequential sections of the page, not side-by-side columns. This is an immediate reject.
Gate 3: Vertical Persistence Relative to Content Span
const contentTop = Math.min(...narrowBands.map(b => b.y));
const contentBottom = Math.max(...narrowBands.map(b => b.y));
const contentSpan = contentBottom - contentTop || 1;
const persistThresh = contentTop + contentSpan * 0.20;
const leftConfined = leftOnly.every(b => b.y <= persistThresh); const rightConfined = rightOnly.every(b => b.y <= persistThresh); if (leftConfined && rightConfined) continue;
Both populations must not both be confined to the top 20% of content height.
This kills financial table column headers: "Three Months Ended", "December 31", "2024", "2025" all appear in the top 5-10% of the page's content span. They pass Gates 1 and 2 but both left-side and right-side header populations are within the top 20% of content height.
Using content span rather than viewport height is essential. A short 2-column article in the top half of a page has all its bands in the top 50% of the viewport. A viewport-relative 20% threshold would correctly flag headers but would also kill the valid 2-column layout, since both populations fall inside vpHeight * 0.20. Content-relative persistence measures confinement against the actual height of the content, not the page frame.
Candidate Generation via Interval Merge
Before evaluating gates, we need candidates -- X values worth testing. The wrong approaches:
O(N²) pair loop: For each pair of (leftBand, rightBand), compute gapCenter = (leftBand.maxX + rightBand.minX) / 2. Produces too many candidates for ragged margins; grid-snap deduplication can collapse two real gutters.
Two-pointer walk on sorted maxX/minX lists: Fails on overlapping intervals. Given bands [100, 300] and [200, 400], both lists interleave and there's no way to know that minX=200 belongs to a band extending to 400, which overlaps the band ending at 300.
Correct: interval merge.
const sorted = [...narrowBands].sort((a, b) => a.minX - b.minX);
const spans = [];
for (const band of sorted) {
if (spans.length && band.minX <= spans.at(-1).hi) {
spans.at(-1).hi = Math.max(spans.at(-1).hi, band.maxX);
} else {
spans.push({ lo: band.minX, hi: band.maxX });
}
}
const rawCandidates = [];
for (let i = 0; i + 1 < spans.length; i++) {
const gap = spans[i + 1].lo - spans[i].hi;
if (gap >= scale.colGapMinPx) {
const center = (spans[i].hi + spans[i + 1].lo) / 2;
if (center >= vpWidth 0.10 && center <= vpWidth 0.90) {
rawCandidates.push(center);
}
}
}
Treat each band as an interval. Merge overlapping intervals into disjoint coverage spans. Find gaps between consecutive spans. Each gap produces one candidate -- the midpoint of the gap.
[100, 300] and [200, 400] merge into [100, 400]. No phantom gap at x=300. The [200, 400] overlap is absorbed correctly during merge.
O(N log N). One candidate per real physical gap. No grid-snapping needed.
The Cascade Effect
Fixing column detection fixes stream table detection for free.
detectStreamTables runs on unclaimedMeta -- text items not already assigned to any region. False column splits cause _splitByColumns to run, which assigns every text item to PARAGRAPH regions at columnIndex=0,1,2,3. By the time stream detection runs, unclaimedMeta.length = 0.
On the AMZN PDF, the financial tables on pages 6-14 have exactly the right properties for stream detection: average item length 6.5 chars (threshold 20 ✓), average items per band 5.9 (threshold 8 ✓), column alignment tight enough to pass fill rate. They would all be detected. But they never get the chance because false column splits already claimed them.
No changes to streamDetector.js are needed. The fix is entirely in _detectPageColumns.
Constants Summary
| Constant | Value | What It Measures | |---|---|---| | WIDE_BAND_FRAC | 0.55 | Band spanning >55% of page width is full-width, excluded from column analysis | | MIN_SIDE | 3 | Minimum committed bands per side | | MIN_COMMITMENT | 0.40 | Fraction of coexistence-zone bands that must commit | | PERSIST_FRAC | 0.20 | Top fraction of content span considered "confined" | | tol | max(4, colGapMinPx * 0.5) | Inward gutter half-width |
None of these are pixel-valued thresholds. They are counts, fractions, and scale-derived values. The algorithm adapts to any page size, zoom level, or font size without configuration.