The Slat Table Blind Spot: Rethinking Table Extraction in PDFs
The Slat Table Blind Spot: Rethinking Table Extraction in PDFs
When building a deterministic PDF extraction pipeline, you quickly learn that visual grids come in many flavors.
Our engine uses a LatticeReconstructor to find physical grids (intersecting horizontal and vertical lines) and a StreamDetector to find borderless tables (using spatial clustering of text anchors). This worked perfectly for 95% of documents.
But during a test on standard financial reports, we hit a massive blind spot: Slat Tables.
The Problem: Parallel Lines and Sparse Headers
A "slat table" is a table that relies entirely on horizontal dividing lines (slats) but has absolutely zero vertical lines. You see these constantly in SEC filings and corporate balance sheets.
Three Months Ended
December 31,
2024 2025
Revenue $50,000 $55,000
Cost of Goods Sold $20,000 $22,000
Gross Profit $30,000 $33,000
When our engine processed this, the table completely vanished. Why?
- The Lattice Reconstructor rejected it: A physical grid mathematically requires at least 3 horizontal lines AND 3 vertical lines to form intersecting cells. Since
vLines.length === 0, it fell through. - The Stream Detector rejected it: Borderless extraction looks for "column anchors"—text items that align vertically across multiple rows. But look at the header: "2024" and "2025" only appear in a single row! There are no vertically stacked anchors to trigger the threshold, so it fell through to being classified as standard paragraphs.
The Naive Attempt
Initially, I thought about just lowering the thresholds. What if the LatticeReconstructor only required 2 lines? What if the StreamDetector only required 1 anchor?
This immediately broke everything else. Lowering the lattice threshold caused random underlines and decorative paragraph separators to instantly collapse into fake "tables". Lowering the stream threshold caused multi-column prose to be shredded into table grids.
The thresholds were protecting the engine. The problem wasn't the thresholds; it was the concept of a slat table. A slat table isn't a grid, and it isn't borderless. It's a hybrid.
The Hybrid Approach: Text-Driven Boundary Inference
To fix this, we needed to pass the text data back into the geometric line reconstructor.
We rewrote the LatticeReconstructor to handle "partial grids". If it detects strong horizontal lines but no vertical lines, it now enters a specialized slat-table path:
- Constrain the Space: We use the horizontal slats to define the
yMinandyMaxof the bounding box. - Filter the Text: We grab every text item that falls inside this boundary.
- Infer the Perpendiculars: We run a localized spatial X-clustering pass only on the text items inside the slats.
// Localized X-clustering to infer missing columns
const clusters = [];
for (const item of sortedTextItems) {
// ... group items by X-proximity ...
}
// Draw a virtual vertical line in the gaps between clusters const cols = [leftEdge]; for (let i = 1; i < clusters.length; i++) { const gap = minLeft - maxRight; if (gap > 0) { cols.push((maxRight + minLeft) / 2); // Virtual column boundary } }
Modeling Visual Spanning
Once the grid was reconstructed, we hit the final hurdle. For a slat table, how do you handle colspan when there are no physical vertical lines separating the header cells?
In HTML, an empty space in a borderless header visually looks like the text is spanning across it. So we updated our HTML tableBuilder to natively model this visual spanning.
If the table lacks vertical lines, any cell containing text will automatically increment its colspan by consuming contiguous empty cells to its right:
// If there are no vertical lines, consume subsequent empty cells
while (c + colspan < numCols && cells[r][c + colspan].length === 0) {
colspan++;
}
This simple algorithmic tweak means that a centered title like "Three Months Ended" automatically expands its colspan across the entire width of the data below it, outputting semantically perfect HTML without a single physical vertical line present in the source PDF.
The Result
By treating slat tables as a first-class structural entity and using text to mathematically infer missing geometric boundaries, the engine now extracts complex financial reports with 100% fidelity.
Deterministic extraction isn't about AI guessing the layout; it's about explicitly modeling the visual rules that humans use to read documents. Slat tables rely on the interplay between lines and text spacing—and now, so does our engine.