Page Assembly Deepdive
Under the Hood: Building a Correct PDF Page Assembler
TLDR: Turning a list of classified PDF regions into correct HTML requires three things that most implementations skip: proportional column widths from measured gutter positions, zone boundaries derived from content top edges (not midpoints), and a zone content classifier that picks the right CSS layout pattern per zone. Each of these has a specific failure mode if you skip it.
The Problem With 1fr 1fr
Every PDF I have seen that was not a single-column article has asymmetric columns. Datasheets have a 30/70 sidebar-plus-body split. Academic papers have a 48/52 split with a fixed gutter. Technical manuals have a 25/75 margin-plus-content split.
CSS Grid 1fr 1fr produces equal columns regardless of the actual split. The gutter X position is available -- _detectPageColumns finds it and returns it as a number. The original code threw it away. The new return shape preserves it:
return {
splits: validSplits.map(sx => ({
x: sx,
leftFraction: sx / vpWidth,
rightFraction: 1 - (sx / vpWidth)
})),
fullWidthIndices
};
In pageAssembler, this drives a CSS custom property on the zone:
if (cols === 2 && columnSplits.length > 0) {
const leftFraction = columnSplits[0].leftFraction || 0.5;
styleAttr = style="--left-col: ${leftFraction.toFixed(4)};";
}
And the CSS rule:
.pdf-zone--cols-2 {
display: grid;
grid-template-columns: calc(var(--left-col, 0.5) * 100%) 1fr;
column-gap: 20px;
}
A 30/70 PDF now renders calc(0.3 * 100%) 1fr. The right column fills the remainder. No hardcoded fractions, no equal-column assumption.
Zone Boundaries: Why midpoints fail
The original zone boundary code:
const y0 = prev
? Math.round((prev.list[prev.list.length - 1].yCenter + g.list[0].yCenter) / 2)
: 0;
yCenter is the midpoint of a region's text content bounding box -- not the region's top edge. When two consecutive zone groups have regions with similar Y positions (a heading at Y=300 immediately above a columnar section starting at Y=302), the midpoint is Y=301.
The zone membership filter was ry >= zone.y0 && ry < zone.y1 where ry = Math.round(region.yCenter). A region with yCenter=300.4 rounds to 300, satisfies 300 < 301, and lands in the first zone. A region with yCenter=300.6 rounds to 301, which equals zone.y1, and gets orphaned.
The fix: use region top edges everywhere.
// _detectAutoZones
const y0 = i === 0 ? 0 : (lead.bbox ? Math.floor(lead.bbox.y) : Math.floor(lead.yCenter));
// assemblePage zone filter const rBboxY = Math.round(region.bbox?.y ?? region.yCenter); // ... const zoneRegions = rendered.filter(r => r.rBboxY >= zone.y0 && r.rBboxY < zone.y1);
A region belongs to the zone it starts in. The top edge of the region is unambiguous. No midpoint arithmetic, no sub-pixel rounding sensitivity.
The Three-Column fullWidthIndices Bug
The original post-correction rescue logic:
const entirelyLeft = columnSplits.every(sx => itemEnd <= sx + tol);
const entirelyRight = columnSplits.every(sx => tm.vx >= sx - tol);
if (entirelyLeft || entirelyRight) fullWidthIndices.delete(idx);
This works for two columns. An item in a two-column layout is either entirely left of the split or entirely right of it. In a three-column layout, an item in the middle column is entirely right of splits[0] but entirely left of splits[1]. It fails both conditions. It stays in fullWidthIndices. It renders as a full-width zone break inside a three-column section.
The fix:
const colBoundaries = [-Infinity, ...columnSplits, Infinity];
const fitsInOneColumn = colBoundaries.slice(0, -1).some((lo, ci) => {
const hi = colBoundaries[ci + 1];
return tm.vx >= lo - tol && itemEnd <= hi + tol;
});
if (fitsInOneColumn) fullWidthIndices.delete(idx);
colBoundaries creates N+1 boundary values for N columns: [-Infinity, split0, split1, Infinity] for three columns. fitsInOneColumn checks whether the item fits within any single column's X range. An item in the middle column of a three-column layout fits within [splits[0], splits[1]] and is correctly rescued. An item that genuinely spans two columns does not fit within any single boundary pair and stays full-width.
The Zone Content Classifier
The zone content classifier runs inside _detectAutoZones after the groups are formed. It reads the region sequence inside each multi-column zone and assigns a layout pattern class:
// CARD_GRID: 3+ headings at approximately the same Y
const headings = g.list.filter(r => r.type === RegionType.HEADING);
if (headings.length >= 3) {
const yBuckets = [];
for (const h of headings) {
const bucket = yBuckets.find(b => Math.abs(b.y - h.yCenter) < 15);
if (bucket) { bucket.count++; ... }
else yBuckets.push({ y: h.yCenter, count: 1 });
}
if (yBuckets.some(b => b.count >= 3)) isCardGrid = true;
}
// FEATURE_LAYOUT: 2-col zone where one column is all visual, the other has text if (zoneCols === 2) { const col0AllVisual = col0.every(r => r.type === RegionType.HEADING || r.type === RegionType.IMAGE); const col1HasText = col1.some(r => r.type === RegionType.PARAGRAPH || r.type === RegionType.LIST); if (col0AllVisual && col1HasText) isFeature = true; }
CARD_GRID gets layout-card-grid which overrides grid-template-columns with repeat(auto-fit, minmax(200px, 1fr)) -- flex-wrap semantics at grid level, no fixed column count. FEATURE_LAYOUT gets layout-feature with align-items: start so the taller column does not stretch the shorter.
Table Cell Content: From esc() to rebuildText
The original tableBuilder.js cell assignment:
cells[bestR][bestC].push({ text: item.str.trim(), x: sx });
And cell render:
const cellContent = esc(allItems.map(i => i.text).join(' ').trim());
Two problems: { text, x } strips bold, italic, underlined from the item. And esc(join(' ')) collapses multiple lines of cell content into one string with all formatting lost.
The fix stores the full item plus position:
cells[bestR][bestC].push({ ...item, _x: sx, _y: sy });
Sort by Y then X before rendering:
cells[r][c].sort((a, b) => a._y !== b._y ? a._y - b._y : a._x - b._x);
And render through rebuildText with inline-html format:
const cellContent = rebuildText(allItems, 0, { format: 'inline-html' }) || ' ';
inline-html emits <strong>, <em>, <u> for styled runs and joins multi-line content with <br> instead of a space. A cell with a bold header and two lines of body text now renders <th><strong>Header</strong></th> and <td>Line 1<br>Line 2</td> respectively.
What This Feeds
The assembled output now drives a correct box model: proportional columns, accurate zone boundaries, layout-appropriate CSS, styled table cells. The selection mode drag-and-drop system works on .pdf-zone and .pdf-region classes which are unchanged. The export engines query by pageAssembler class names which are unchanged. The zone structure is more correct; everything downstream is unaffected.