Engineering Journal
Pdf Processor
Pdf Processor

How to Classify PDF Tables by Their Borders (and Why It Matters): LATTICE vs STREAM

2026-05-15

TLDR: LATTICE_TABLE has physical ruling lines; STREAM_TABLE has none. tableBuilder.isBorderless skips colspan/rowspan expansion for stream tables. Each type gets a different wrapper class so CSS can render bordered vs borderless table semantics correctly.

Repo: tools/pdf-processor

The Two Kinds of Table

Lattice tables draw their structure. You can see the lines in the PDF. LatticeReconstructor finds horizontal and vertical path segments, clusters them into rows and columns, and produces a cell grid.

Stream tables imply their structure. Columns exist only because text items share consistent X positions across multiple rows. No ruling lines. StreamDetector finds these by Y-band grouping and X-position clustering.

Both detectors output the same synthetic lattice object because tableBuilder is the shared downstream renderer.


The Naive Approach

Render everything the same way. Both types go through tableBuilder, which expands cells according to the line grid.

For stream tables this is catastrophic. StreamDetector outputs hLines: [] and vLines: []. Without ruling lines, tableBuilder's colspan/rowspan expansion loops run from each cell to the grid edge, collapsing all content into one giant spanning cell.


The isBorderless Flag

const isBorderless = hLines.length === 0 && vLines.length === 0;

When true, tableBuilder skips all expansion logic. Every cell gets colspan=1, rowspan=1. The grid resolves to individual cells positioned by their text Y and X coordinates.


RegionType Split

contextClassifier now emits two distinct types:

pageAssembler wraps them in different classes:
<!-- Lattice: bordered data table -->
<div class="pdf-table-wrap pdf-table--lattice">...</div>

<!-- Stream: alignment-only table --> <div class="pdf-table-wrap pdf-table--borderless">...</div>

The CSS difference:

.pdf-table--lattice td    { border: 1px solid #ccc; padding: 4px 8px; }
.pdf-table--borderless td { padding: 4px 12px 4px 0; }

A spec sheet's label-value pairs look like a structured list when rendered borderless. A data table with visible rules looks like a grid. The distinction produces the correct reading experience for each table type.


The Overlap Guard

Both detectors can fire on the same region. Lattice always wins:

// If 80%+ of candidate stream items are already claimed by a lattice, skip
if (latticeItemCoverage > 0.8) continue;

Physical lines are unambiguous evidence. If LatticeReconstructor already found a grid, StreamDetector should not reclassify it.


Why This Matters for Semantic Export

A document exported to HTML for a CMS or documentation system needs correct table semantics. A bordered table in a PDF is a data table. A borderless column alignment is a definition list or a key-value pair block. Treating both as <table border=1> corrupts the semantics of the exported content.

The LATTICE/STREAM split preserves that distinction without requiring the user to manually reclassify tables after export.

Read this post in the full Engineering Journal →