Page Assembly Postmortem
Post-Mortem: Implementing the Page Assembly Refactor
TLDR: I wrote a 600-line architecture document for the page assembly refactor, the user implemented all nine steps in one session, and a ReferenceError on the first PDF load revealed a scoping mistake I had quietly designed in. Here is what actually happened.
What I Was Trying to Build
The page-assembly-design-philosophy document describes a nine-step refactor of pageAssembler.js and contextClassifier.js. The goals: proportional column widths instead of 1fr 1fr, correct zone boundaries, a zone content classifier (CARD_GRID, FEATURE_LAYOUT), table cells that preserve bold/italic, and a fix for the three-column fragmentation bug.
I wrote the spec across two sessions, stress-tested the architectural decisions, and believed it was ready to implement.
What Failed
The pageWidth ReferenceError
The most expensive failure was a scoping mistake buried in the layout classifier.
The architecture document specified a FEATURE_LAYOUT detection heuristic inside _detectAutoZones. That heuristic needs to compare region X positions against the page midpoint -- r.bbox.x < pageWidth / 2. I wrote that into the spec without noting that pageWidth is only defined inside assemblePage(), not in _detectAutoZones, which is a module-level function.
The result:
Error loading PDF 1: Error: pageWidth is not defined
at worker.onmessage (fileUpload.js:83:24)
The error fires on every PDF load, immediately, before any regions are classified. The stacktrace points at the worker message handler, not at _detectAutoZones. Without reading the file, the error is not self-explanatory.
The fix is one line: add pageWidth as a third parameter to _detectAutoZones and pass it at the call site. That is two characters of change after you know what the problem is. Finding the problem from the stacktrace took longer.
The underlying cause: the architecture document described behavior without specifying the execution boundary. "The zone classifier checks r.bbox.x < pageWidth / 2" is a description of what the code does, not where the code lives. I should have written "pass pageWidth to _detectAutoZones as a third argument" explicitly.
columnSplits Shape Change
The Step 1 change turned columnSplits from number[] to {x, leftFraction, rightFraction}[]. Three places in the existing code read columnSplits as numbers. The architecture document specified the shape change and noted that callers would need updating, but only for the two files explicitly in scope.
The actual callers that needed updating: _splitByColumns (uses splits as numbers internally -- already protected by the local .map(s => s.x) extraction), the fallback in classifyPage (needed .map(s => s.x) on the fallback result), and any future caller that destructures the array expecting numbers.
This one did not break because the user was careful about where the .map(s => s.x) extraction happened. But the architecture document understated the blast radius.
What Survived
The zone boundary fix (Step 6) survived intact. Using bbox.y instead of the yCenter midpoint is correct and the implementation matches the spec exactly.
The wide region promotion (Step 7) survived. The 0.65 threshold and the placement of the pass (after _detectPageColumns, before column bucket assignment) are correct.
The fitsInOneColumn fix (Step 8) survived. The colBoundaries = [-Infinity, ...columnSplits, Infinity] pattern correctly handles any number of splits.
The table cell content fix (Step 3.5) survived. Storing full item objects with _x and _y, sorting by Y then X, calling rebuildText with inline-html -- all three changes are in tableBuilder.js as specified.
The leftFraction CSS variable (--left-col) survived and now drives grid-template-columns: calc(var(--left-col, 0.5) * 100%) 1fr for proportional two-column layouts.
What I Learned
A spec that describes what code does without specifying what scope it runs in has a hidden error. The FEATURE_LAYOUT heuristic was correct. The function signature was not specified. Those are different things, and the second one is what breaks the implementation.
The fix discipline: every time a spec function references a variable by name, ask whether that variable is in scope at the call site of the function, not just at the site where it is used.
Files Changed
tools/pdf-processor/src/extraction/vector/pageAssembler.js--_detectAutoZonessignature, zone Y-boundary logic, layout classifier,--left-colvariable wiringtools/pdf-processor/src/extraction/vector/contextClassifier.js--_detectPageColumnsreturn shape,fitsInOneColumnpost-correction, wide region promotion passtools/pdf-processor/src/extraction/vector/tableBuilder.js-- full item objects in cell buckets, Y-then-X sort,rebuildTextinline-html cell content