Engineering Journal
Pdf Processor
Pdf Processor

Path Reconciler Postmortem

2026-05-30

Post-Mortem: Building a PDF Path Reconciler That Actually Works

TLDR: I wrote an architecture document, stress-tested it against 8 real-world PDF edge cases before touching a single file, threw away three core decisions, and ended up with something completely different from the original plan. Here is what failed and why.


What I Was Trying to Build

The PDF extraction pipeline needs to turn a raw PDF.js operator list into clean, typed path segments -- horizontal lines, vertical lines, diagonals, table frames. The architecture document described a pathReconciler.js module that would sit between ctmAdapter.js (which reads the operator list) and latticeReconstructor.js (which infers table grids from those segments).

The original plan had three load-bearing ideas:

  1. Use shapeId gating in LatticeReconstructor._mergeH to prevent compound path segments from merging across table cell boundaries.
  2. Use de Casteljau subdivision to get an approximate bounding box for Bezier curves.
  3. Partition dashes into runs by detecting consecutive same-color single-segment subpaths.
All three were wrong.


What Failed

The ShapeId Guard

The plan proposed tagging each segment with the ID of its source constructPath call. Then, inside LatticeReconstructor, when two collinear segments are candidates for merging, check that they share the same constructPathId. If they don't, skip the merge.

This sounds airtight. It is not.

JasperReports generates tables where every cell border is a separate path operation. A 4-column, 10-row table might have 70 separate constructPath calls, each drawing one border segment. Under the proposed rule, no two segments would ever merge, and every cell would appear to be a separate line with no connected grid.

The same failure mode appears in LibreOffice Writer exports and in any PDF generator that emits path operations per-cell rather than per-table.

The correct place to prevent phantom cross-path merges is in pathReconciler itself -- emit segments correctly classified per subpath, let LatticeReconstructor merge freely. The gap-based test inside lattice reconstruction is the right guard; a shapeId whitelist is too restrictive.

De Casteljau for Bounding Boxes

The plan said: subdivide each cubic Bezier until all control points fit within a bounding-box tolerance, then use the control point hull as the bbox.

De Casteljau is the right algorithm for curve rendering and nearest-point computation. It is the wrong algorithm for bounding box calculation.

The correct approach is analytical: the extrema of a cubic Bezier occur at t = 0, t = 1, and the roots of the derivative B'(t) = 0. The derivative of a cubic is a quadratic, so solving it is eight scalar evaluations and a square root. No subdivision loop. No tolerance parameter. No accumulated floating-point error from repeated midpoint splits. Exact.

The de Casteljau approach also had a subtler failure: sub-threshold segments produced by the subdivision loop would then be filtered by a minLen guard downstream, causing bounding boxes to underestimate for curves with steep diagonals (callout tails, logo arcs).

Consecutive Dash Partitioning

The original dash merge algorithm walked the subpath list in order, accumulating a run as long as consecutive subpaths were co-linear and same-color. When the run broke, it emitted a merged segment.

This fails for any PDF that intersperses horizontal dashes with other drawing ops. The dashes that form one logical dashed line are non-adjacent in the operator list. Consecutive partitioning misses them.

The fix: global partition. Group ALL single-segment FREE_PATH subpaths by (strokeColor, round(strokeWidth × 2), orientation, Y-band bucket). Sort each partition by position. Gap-test within partition. This correctly merges all dashes on the same logical line regardless of where they appear in the operator stream.

The strokeWidth bucket was also missing from the original partition key. Collinear lines at the same Y but different widths would have merged into a single segment -- a wide rule absorbing thin hairlines alongside it.


What Survived

The SubpathRecord data structure survived intact: segs[] for line segments in PDF user-space, curves[] for raw cubic control points, closed, filled, strokeWidth, strokeColor, fillColor, constructPathId, ctm, id. Every field earned its place.

The classification scheme (RECT, ROUNDED_RECT, POLYGON, FREE_PATH) survived. The thin rect normalization rule (h < eps collapses to a centered H segment) survived. The reconcile/emit pattern survived.

The curveBboxContrib function I wrote as a replacement is exactly 22 lines and handles all curve shapes including the diagonal ops that would have caused problems under subdivision.


What I Learned

Read your own architecture documents like an adversary before you implement them. I wrote these stress tests before touching any code. The cost was one hour of analysis. The benefit was not discovering these failures mid-implementation when reverting is expensive.

The three failures share a pattern: I reached for a local, elegant-looking solution (shapeId guard at merge time, subdivision, consecutive run detection) when the correct solution required a global view (shape isolation belongs in the upstream emitter, analytical roots are global by definition, global partition covers all adjacency patterns).


Files That Changed

Read this post in the full Engineering Journal →