Engineering Journal
Pdf Processor
Pdf Processor

Bipartite Column Detection Errorfix

2026-05-30

Four Bugs in One Algorithm Design

TLDR: The v1 bipartite column detection design had four compounding bugs before a single line was written in production. All four were caught in design review against the AMZN Q4 2025 PDF. Each one would have caused a silent failure in a different edge case.


Bug 1: Tolerance Direction Reversed

Symptom: Two bands that physically overlap get classified as belonging to opposite sides of a split.

Root Cause: The v1 definition used tol to expand the allowed space:

const leftOnly  = narrowBands.filter(b => b.maxX  <= X + tol);
const rightOnly = narrowBands.filter(b => b.minX  >= X - tol);

A band ending at X + 9 satisfies maxX <= X + 10 -- counted as committed left. A band starting at X - 9 satisfies minX >= X - 10 -- counted as committed right. Those two bands overlap by 18 pixels and share the gutter zone. The algorithm counts them as a clean partition.

Fix: Tolerance must shrink the allowed space inward:

const leftOnly  = narrowBands.filter(b => b.maxX  <= X - tol);
const rightOnly = narrowBands.filter(b => b.minX  >= X + tol);

A band is "committed left" only if it ends at least tol pixels before the split center. No overlap is possible between committed populations.

Guard: If leftOnly and rightOnly can ever share a band, your tolerance direction is wrong.


Bug 2: Gate 3 Used Viewport Height, Not Content Height

Symptom: A short 2-column article at the top of a page has its valid split rejected.

Root Cause: Gate 3 compared band Y values against vpHeight * PERSIST_FRAC:

const persistThresh = viewport.height * 0.20;
const leftConfined  = leftOnly.every(b => b.y <= persistThresh);
const rightConfined = rightOnly.every(b => b.y <= persistThresh);
if (leftConfined && rightConfined) continue;

If the page has content only in the top 50% and blank space below, all bands live in the top 50% of the viewport. The intended target was financial table column headers (which live in the top 5-10% of content). But a viewport-relative 20% threshold incorrectly catches any page where content doesn't fill the full viewport.

Fix: Measure confinement against the actual content span:

const contentTop    = Math.min(...narrowBands.map(b => b.y));
const contentBottom = Math.max(...narrowBands.map(b => b.y));
const contentSpan   = contentBottom - contentTop || 1;
const persistThresh = contentTop + contentSpan * 0.20;

Now "confined to top 20%" means the top 20% of whatever height the content actually occupies. A short article has its own content span. Financial headers at the top 5% of a full-page document still fail Gate 3.

Lesson: Any threshold that references the page frame rather than the content is implicitly assuming the content fills the page.


Bug 3: Gate 2 Denominator Was Global Band Count

Symptom: A genuine 2-column section in the bottom third of a page has its split rejected.

Root Cause: Gate 2 divided by total narrow bands:

if ((leftOnly.length + rightOnly.length) / narrowBands.length < 0.40) continue;

If the top 70% of a page is single-column and the bottom 30% is 2-column, the committed bands from the 2-column section represent only 30% of the total. 0.30 < 0.40. The valid split is rejected.

Compounding this: union Y-span as the denominator scope would be contaminated by outliers. A stray left-aligned line at Y=800 (a "Sincerely," at the bottom of a letter) satisfies maxX <= X - tol and gets added to leftOnly. The union Y-span stretches to Y=800. All single-column text in Y=400-800 floods localBands. The ratio collapses.

Fix: Use the coexistence Y-span as denominator scope -- the Y-range intersection of the left and right populations:

const coexistTop    = Math.max(Math.min(...leftOnly.map(b => b.y)), Math.min(...rightOnly.map(b => b.y)));
const coexistBottom = Math.min(Math.max(...leftOnly.map(b => b.y)), Math.max(...rightOnly.map(b => b.y)));
if (coexistBottom < coexistTop) continue;  // populations never co-exist vertically
const localBands = narrowBands.filter(b => b.y >= coexistTop && b.y <= coexistBottom);
if ((leftOnly.length + rightOnly.length) / localBands.length < 0.40) continue;

Coexistence span = where both left and right populations are simultaneously present. An outlier at Y=800 is outside coexistBottom if the right-side population ends at Y=400. It never stretches the interval.

Guard: If coexistBottom < coexistTop, the populations are vertically disjoint -- sequential sections, not columns. Reject before computing any ratio.


Bug 4: Two-Pointer Walk Breaks on Overlapping Intervals

Symptom: A real column gutter goes undetected because overlapping band intervals confuse the candidate scan.

Root Cause: The proposed O(N log N) candidate generation sorted maxX and minX values into separate lists and walked them with two pointers. This breaks immediately on overlapping intervals.

Given bands [100, 300] and [200, 400]:

When the pointer reaches maxX = 300, minX = 200 has already advanced past it. The walk sees maxX=300 and the next minX after it is... there isn't one. The bands [100,300] and [200,400] form one overlapping coverage blob. But the two-pointer walk has no mechanism to detect that minX=200 belongs to an interval that reaches past maxX=300. It would either miss the overlap or find a phantom gap.

Fix: Interval merge before gap search:

const sorted = [...narrowBands].sort((a, b) => a.minX - b.minX);
const spans = [];
for (const band of sorted) {
    if (spans.length && band.minX <= spans.at(-1).hi) {
        spans.at(-1).hi = Math.max(spans.at(-1).hi, band.maxX);
    } else {
        spans.push({ lo: band.minX, hi: band.maxX });
    }
}
// gaps between spans are candidates

[100, 300] and [200, 400] merge into [100, 400]. No phantom gap at x=300. The gap search runs on disjoint spans only. One candidate per real physical gap.

Lesson: Two independent sorted lists cannot represent interval overlap information. If you need to reason about gaps in a set of intervals, merge them first.


The Common Thread

All four bugs share the same root: the algorithm was reasoning about geometry using local properties (a single band's X position, a single band's Y position) without accounting for the global structure (the full population's Y span, the intervals' actual coverage overlap).

Threshold direction, reference frame, denominator scope, interval overlap -- each one looks correct in isolation and fails when the input has the edge case the local view can't see.

Read this post in the full Engineering Journal →