Pdf Processor

The Slat Table Blind Spot: Rethinking Table Extraction in PDFs

2026-05-11

TLDR

Financial balance sheets frequently encode tables using only horizontal dividing lines ("slats") with zero vertical lines and non-aligned header text. These tables fail both vector lattice reconstructors ($\text{vLines} = 0$) and stream detectors (sparse header column anchors). Implementing a partial-grid path that clusters $X$-coordinates strictly within horizontal slat boundaries dynamically infers vertical boundaries, while rightward empty cell consumption handles colspan automatically.

Extraction Mode	Grid Requirement	Slat Table Extraction Result
Strict Vector Lattice	$\ge 3$ Horizontal AND $\ge 3$ Vertical lines	Fails ($\text{vLines} = 0$, dropped to prose)
Borderless Stream Detector	Multi-row vertically aligned $X$-anchors	Fails (Sparse single-row headers)
Hybrid Slat Inferrer	Horizontal slats + localized text $X$-clustering	100% Correct Grid & `colspan`

Technical slat table boundary & spanning engine

// Infer missing vertical boundary lines inside horizontal slat Y-bounds
export function inferSlatTableColumns(sortedTextItems, topSlatY, bottomSlatY) {
  // Filter text items strictly falling within horizontal slat bounds
  const slatTextItems = sortedTextItems.filter(item => item.y >= topSlatY && item.y <= bottomSlatY);
// Group items by X-proximity to identify natural column clusters   const xClusters = clusterItemsByXProximity(slatTextItems);
// Construct virtual vertical boundaries in the gaps between X-clusters   const virtualColBounds = [slatTextItems[0].xMin];   for (let i = 1; i < xClusters.length; i++) {     const gapCenterX = (xClusters[i - 1].xMax + xClusters[i].xMin) / 2;     virtualColBounds.push(gapCenterX);   }
return virtualColBounds; }
// Expand HTML table cell colspan across contiguous empty grid cells export function computeSlatCellColspan(cellGrid, rowIndex, colIndex, totalCols) {   let colspan = 1;   while (colIndex + colspan < totalCols && cellGrid[rowIndex][colIndex + colspan].isEmpty) {     colspan++;   }   return colspan; }

Rule of thumb: Combine horizontal vector line boundaries with localized text $X$-clustering to infer virtual vertical grid lines for un-bordered slat tables.

Read this post in the full Engineering Journal →