Pdf Processor

Detecting Borderless Tables in PDFs Without a Neural Net

2026-05-22

TLDR: streamDetector.js finds borderless tables by grouping text rows into Y-bands, clustering X-positions into column anchors, and scoring alignment regularity. Confidence threshold 0.45 blocks false positives. No ML, no server.

Repo: tools/pdf-processor

The Problem

Lattice-based table detection finds tables by their ruling lines. Horizontal and vertical vector segments form a grid, and grid intersections become cell corners.

This misses an entire class: tables with no lines at all. Specification sheets, datasheets, and technical correspondence are full of them. A column of component names next to a column of values. No border, no separator, just aligned text.

The lattice path emits those rows as paragraphs.

The Naive Attempt

First instinct: anything with two or more text items on the same Y is a row. More than two rows with consistent X alignment equals a table.

This fires on paragraph left margins, bullet list indents, and two-column prose. The detector needs more signal than shared Y-position.

Why It Broke

Problem 1: X-alignment alone is not enough.

A paragraph's lines all start at the same X. The detector was flagging paragraph blocks as tables because the left margin of every line is a "column anchor."

A real table has multiple X-positions per row that repeat across rows. A paragraph has one anchor but only that one stays consistent.

Problem 2: Row spacing is not uniform.

A spec sheet table has a header row, a thin separator gap, then data rows. A fixed-gap threshold splits the header into its own group, which then fails the minimum-band count check. The whole table becomes invisible.

Problem 3: Footer bands corrupt the score.

The confidence score includes a rowSpacingScore measuring how regular inter-band gaps are. Footer lines in the same Y range add 200px outlier gaps. Regular-looking data rows score near-zero confidence and get dropped.

The Math Behind the Fix

Step 1 — Y-band grouping

Group text items whose Y positions are within bodyFontVp * 0.4 of each other. This relative tolerance scales with the document's body font size, measured in viewport pixels.

Each band is a visual row: a horizontal slice of the page at the same reading line.

Step 2 — Adaptive gap splitting

Split bands into candidate table groups using adaptive gap detection, not a fixed threshold.

function _groupBandsByAdaptiveGap(bands) {
    const gaps = [];
    for (let i = 1; i < bands.length; i++) {
        gaps.push(bands[i].minY - bands[i - 1].maxY);
    }
    const medianGap = median(gaps);
    const splitThreshold = Math.max(medianGap * 2.5, 20);
    // split wherever gap > splitThreshold
}

The threshold derives from the data itself. A spec sheet with 43px inter-row gaps sets splitThreshold = 107px. A section break at 183px splits correctly. All data rows stay in one group.

If every group fails the minimum band count (3), the fallback treats all bands as one group. This handles single-section tables.

Step 3 — Column anchor clustering

For each candidate group, find X positions that appear in at least 2 distinct bands within a colTol radius (bodyFontVp * 0.8).

const anchors = [];
for (const x of xPositions) {
    const existing = anchors.find(a => Math.abs(a.x - x) < colTol);
    if (existing) {
        existing.count++;
        existing.x = runningMean(existing.x, x);
    } else {
        anchors.push({ x, count: 1 });
    }
}
const columnAnchors = anchors.filter(a => a.count >= 2);

A paragraph has one anchor (its left margin) with high count but low column diversity. A real table has 2 to N anchors, each backed by items from multiple bands.

Step 4 — Confidence scoring

Two metrics are combined:

Column alignment score: How tightly do items cluster around their anchors? Compute the standard deviation of distances from each item to its nearest anchor, normalized by colTol. Low std dev equals tight alignment equals high score.

Row spacing score: How regular is the inter-band gap? Compute std dev of row gaps only for bands that align to at least one column anchor. Footer bands and section labels that don't align to any anchor are excluded.

const participatingBands = bands.filter(b =>
    b.items.some(item =>
        anchors.some(a => Math.abs(item.vx - a.x) < colTol)
    )
);
const bandGaps = participatingBands
    .map((b, i, arr) => i === 0 ? null : b.minY - arr[i - 1].maxY)
    .filter(Boolean);
const rowSpacingScore = 1 - (stdDev(bandGaps) / mean(bandGaps));

Final confidence: (colAlignScore + rowSpacingScore) / 2. Threshold: 0.45.

The Overlap Guard

Lattice detection and stream detection can both fire on the same region. Lattice always wins.

Before emitting a stream table, check how much of its bounding box is already covered by an existing lattice table region. If coverage exceeds 80%, suppress the stream result.

const coverage = computeOverlapFraction(streamCandidate.bbox, latticeRegion.bbox);
if (coverage > 0.8) continue;

The right boundary of each stream column extends to actual item right edges, not just anchor position plus tolerance. A multi-word cell like "Maximum Operating Temperature" lives 180px to the right of its anchor X. Without this, the column bbox truncates the cell content.

What tableBuilder Gets

streamDetector emits a synthetic lattice object:

{
    hLines: [],
    vLines: [],
    border: false,
    detectionMethod: 'stream',
    confidence: 0.61
}

Empty hLines/vLines tells tableBuilder to produce 1×1 un-spanned cells. No changes needed in the builder. Stream tables emit the same <table> structure as lattice tables.

The isBorderless flag (hLines.length === 0 && vLines.length === 0) also gates the colspan/rowspan loops. Without it, the expansion loop runs to the grid edge and collapses 87+ items into a single <th colspan=N rowspan=M>.

Tradeoffs

Confidence threshold 0.45 is empirical. Calibrated against one sparktoro spec sheet. Documents with irregular column spacing or mixed-alignment rows will need tuning.

Single-column tables are skipped. Stream detection requires at least 2 column anchors. A formatted single-column list will not trigger either detection path.

Right boundary is approximate. Column width uses the maximum right edge of items in that column. A very wide outlier inflates the column. A percentile-based approach would be more robust.

The sparktoro validation case in numbers: median inter-row gap = 43px, adaptive threshold = 107px, section break at 183px splits correctly, all 12 data rows stay in one group, confidence = 0.61, table detected cleanly.

Read this post in the full Engineering Journal →