Pdf Processor

Bipartite Column Detection Hottake

2026-07-11

PDF Column Detection Is a Graph Problem, Not a Signal Processing Problem

TLDR: Every PDF extractor I've seen treats column detection as a signal processing problem: find the valley in the histogram. That's the wrong framing. Column detection is a graph separator problem. Until the field switches frames, PDF extractors will keep shipping magic threshold lists and calling it done.

The Industry Default Is Broken By Design

The standard approach to PDF column detection goes like this: project all text onto the X axis, build a coverage histogram, find valleys. Column gutter = low coverage zone.

This is a signal processing framing. You have a 1D signal (coverage by X position), you find features in that signal (low valleys), and you call those features column gutters.

Signal processing is the right tool when your input is a continuous measurement with noise, such as audio, sensor data, or images. It is the wrong tool when your input is a set of geometric objects with discrete relationships. Text bands are not measurements. They are structural elements. The question "is there a column here?" is not "is this value below threshold?" It is "does this X value separate this population into two independent groups?"

That is a graph theory question. Specifically, it is asking whether X is a separator of the interval graph formed by the text bands.

Why Histogram Valleys Are Specifically Wrong for PDFs

The histogram approach encodes three assumptions that PDFs routinely violate:

Assumption 1: Low coverage implies a column gutter. Reality: Bullet indentation creates low coverage on the left. Financial table column headers create low coverage on the right. Wide left margins create low coverage in the middle. None of these are column gutters. Low coverage is a necessary condition, not a sufficient one.

Assumption 2: Coverage is uniform within a column. Reality: Text lines have varying widths. The right edge of the left column is ragged. The left edge of the right column is ragged. The histogram shows intermittent coverage in the gutter zone, not a clean valley, even for real multi-column layouts. Threshold tuning is an arms race against the content.

Assumption 3: The gutter is in a predictable location relative to the page width. Reality: Academic papers have wide gutters at 50%. Financial reports have narrow columns at 60%. Newsletters have three columns. MARGIN_FLOOR tricks (don't look before 10% or after 90%) are bandages on a fundamentally wrong model.

The Right Frame: Graph Separator

A text band is an interval [minX, maxX]. The interval graph has a node for each band and an edge between two nodes if their intervals overlap. A column split at X is valid if X is a separator of this graph: removing X partitions the nodes into two independent sets (left-only bands and right-only bands) with no crossing edges.

That is the exact question the bipartite band partition asks. Not "is coverage low here?" but "do the bands commit cleanly to two sides?"

The three validation gates flow naturally from this framing:

Gate 1 (population count): a separator that isolates 1 band on one side is a degenerate cut, not a column boundary
Gate 2 (commitment ratio): if most bands cross the candidate separator, it is not a real structural boundary
Gate 3 (vertical persistence): a separator that only holds in the top 5% of the content height is a local anomaly, not a page-level layout feature

None of these gates have pixel-valued thresholds. They are structural predicates about the band population. The algorithm is self-normalizing: it does not need to know the page dimensions, the font size, the zoom level, or the language of the document.

The Cascade Problem Nobody Talks About

Bad column detection doesn't just produce bad column layout. It poisons every downstream detector.

Most PDF extractors run column detection first, then assign text items to regions, then run table detection on unclaimed items. If column detection produces false splits, all text items get claimed by paragraph regions. The table detector sees an empty input. Every borderless table on the page is silently missed.

This is not a PDF-specific problem. It is a pipeline contamination problem. Any upstream heuristic that overclaims input will starve downstream detectors. The AMZN Q4 2025 earnings release had 0% stream table detection rate on 14 pages. Not because the stream detector was wrong. Because the column detector had already consumed every text item before stream detection ran.

Fixing the upstream gate fixed the downstream detector as a free consequence. Not by changing the stream detector at all, just by stopping the false claims.

What the Industry Should Do Instead

Stop treating PDF layout analysis as signal processing. Start treating it as computational geometry with structural predicates.

Column detection: interval graph separator (this post). Table detection: column-anchor alignment clustering (stream tables) or segment topology (lattice tables). Region classification: spatial containment + semantic context from surrounding structure.

Each of these is a geometric or topological question, not a threshold question. The answers are deterministic given the input geometry. No training data, no model weights, no threshold tuning.

The reason the industry reaches for signal processing is that it is fast to implement and works on the easy cases. The reason the industry ships threshold lists is that the easy-case solution fails on real documents and the path of least resistance is to add more thresholds.

The cost of that path is predictability. You cannot reason about what a threshold-based system will do on a document you haven't tested. You can reason about what a structural predicate will do: either the geometry satisfies the predicate or it doesn't.

Determinism is not a nice-to-have for document extraction. It is the entire value proposition. If users cannot predict what the extractor will do, they cannot trust the output. And if they cannot trust the output, they will not use the tool.

Read this post in the full Engineering Journal →