Bipartite Column Detection Postmortem
Postmortem: Three Rewrites to Get Column Detection Right
TLDR: The PDF column detector had a fundamental design flaw — it asked the wrong question. Two full rewrites of the algorithm and four pressure-test failures later, we landed on a structural predicate that actually works.
What We Started With
The original _detectPageColumns built a pixel-wide coverage histogram: for each X position on the page, count how many narrow text bands cross that pixel. Find valleys where fewer than 20% of bands have coverage. Call those valleys column gutters.
It works on the whiteboard. It fails on real PDFs.
The AMZN Q4 2025 earnings release exposed three distinct failure modes on 12 out of 14 pages. Every single financial table on pages 6-14 was split into 3-4 false columns. Pages 1-3 with bullet lists were split down the middle. The downstream stream detector never fired once — because false splits had already claimed every text item as a PARAGRAPH region, leaving unclaimedMeta.length = 0.
One algorithm failure cascaded into two complete feature failures.
What We Tried First (v1)
We replaced the histogram valley scan with a bipartite partition approach. The idea: instead of asking "is there a coverage gap?", ask "does this X value partition bands into two populations with no crossing members?"
The v1 gates were:
leftBands = narrowBands where maxX <= X + tolrightBands = narrowBands where minX >= X - tol- Gate 1: 3+ bands each side
- Gate 2: 40% of all narrow bands committed
- Gate 3: not both populations confined to top 20% of viewport height
Tolerance was backwards. Defining left as maxX <= X + tol means a band ending 10px past the split center still counts as "entirely left." Meanwhile a band starting at X - 10px counts as "entirely right." Those two bands physically overlap by 20px, yet the math classifies them as a clean partition. We were expanding the allowed space instead of shrinking it.
Gate 3 used viewport height, not content height. A page with a short 2-column article in the top half and blank space below has all its bands in the top 50% of the viewport. With a fixed 20% viewport threshold, both populations test as "confined" and Gate 3 kills the valid split.
Gate 2 used global band count as denominator. If the bottom 30% of a page is 2-column and the top 70% is single-column, the committed bands represent only 30% of total narrow bands. Well below the 40% threshold. The 2-column section is permanently invisible.
Candidate generation was O(N²) with grid-snap deduplication. For pages with ragged column margins, every left-band/right-band pair produces a slightly different gapCenter. The grid snap meant two real gutters that happened to land on the same snap point would collapse into one candidate.
We kept the structural insight but had to rebuild the mechanics.
What We Threw Away
The entire histogram approach is gone. bandCount, Float32Array, the valley scan loop, MARGIN_FLOOR — all deleted. The 20% threshold has no replacement because it was the wrong unit of measurement.
The O(N²) pair loop is gone. The two-pointer walk on sorted maxX/minX lists is gone too — it breaks on overlapping intervals and we caught that before implementing it.
The v1 tolerance direction is reversed. Every line of code that said <= X + tol for left and >= X - tol for right is wrong and removed.
What Survived
The core insight survived: column detection is a structural predicate, not a histogram threshold. A valid column split partitions bands into two populations. The question is binary.
The three-gate structure survived. The gates themselves were redesigned but the concept — population count, commitment ratio, vertical persistence — is correct.
WIDE_BAND_FRAC = 0.55 survived unchanged. Bands spanning more than 55% of page width are full-width. That threshold held up across all test cases.
What the Final Design Looks Like
Candidates via interval merge. Each narrow band is an interval [minX, maxX]. Sort by minX, merge overlapping intervals into disjoint coverage spans, find gaps between adjacent spans. One candidate per real physical gap. Handles overlaps correctly. O(N log N). No grid-snapping.
Tolerance shrinks inward. Left means maxX <= X - tol. Right means minX >= X + tol. A band must end before the gutter starts to count as committed left. No overlapping bands can both be committed.
Gate 2 uses coexistence span. localBands is scoped to the Y-range intersection of left and right populations, not their union. A stray outlier that extends one population's Y range doesn't flood the denominator. If the two populations are vertically disjoint, the split is rejected immediately — sequential sections are not columns.
Gate 3 uses content span. persistThresh = contentTop + contentSpan * 0.20. Confinement is measured against the actual height of the content, not the page frame. A short article at the top of a page has its own content span.
Lessons
A histogram is a heuristic about magnitude. It measures how much something is present, not whether something is structurally true. Column detection is a structural question — either the page content partitions or it doesn't. The moment we reached for a threshold ("less than 20%"), we accepted fragility.
The cascade failure was instructive. False column splits don't just produce bad column layout — they consume all text items into PARAGRAPH regions, which starves every downstream detector that depends on unclaimedMeta. One bad gate can silently disable an entire subsystem with no error thrown.
Pressure testing against real PDFs before design review would have caught all four v1 bugs before the first line of production code. The test script existed. We should have run it against the draft algorithm on paper first.