Engineering Journal
Pdf Processor
Pdf Processor

Postmortem: The Monolith That Quietly Poisoned Its Own Pipeline

2026-05-31

TLDR

A monolithic classifier hid a step-ordering bug for weeks. Divider detection ran before text classification, so it had an incomplete region list and false-positived on paragraph rules. The bug was invisible until the refactor exposed it.

Repo: tools/pdf-processor

What We Built

contextClassifier.js started as a focused pipeline: detect columns, detect lattice tables, assemble regions. It was maybe 300 lines. Over several sessions, each new classifier was appended where it fit: underlines, images, headings, lists, stream tables, footers. Each addition made sense in isolation. The file ended at 1150 lines.


What Failed

Phantom hr.pdf-divider elements were appearing in extracted HTML. On documents with horizontal rules near paragraph text, divider detection was claiming segments that belonged inside paragraphs.

The root cause: detectDividers ran at step 9. Text classification (paragraphs, headings, lists) ran at step 11. At step 9, the regions array only contained image, lattice, stream, and box regions. The containment check "is this segment already inside a region?" saw no paragraph regions because none existed yet. Every long horizontal segment inside a paragraph tested clean and became a DIVIDER.

This had been shipping for multiple sessions. Documents with few paragraph-area rules looked fine. Documents with decorative horizontal rules inside body text had extra dividers. Nobody had tested the second case carefully enough to catch it.


What We Threw Away

The single-file structure. All 1150 lines of contextClassifier.js were the thing that needed to go. Not any individual function, not the algorithm, not the detection logic itself. The structure that let step 9 precede step 11 with no compile-time or runtime warning was the problem.

Also thrown away: _checkAnalysisTier, dead constants, and three duplicate regex patterns that lived in both the classifier and the assembler.


What Survived

The algorithms themselves. Column detection, underline detection via baseline proximity, lattice scoping, stream table bucketing, heading detection via font-size ratio, header/footer position heuristics. All of these were correct. They just needed to stop sharing a mutable namespace and start taking explicit inputs.

The spatialGraph.PageGraph structure also survived and was promoted: it is now constructed once and passed as an argument through the entire pipeline, replacing the ad-hoc proximity queries that each detector had been running independently.


What Replaced It

Eleven single-responsibility modules in src/extraction/vector/classifiers/, each exporting one classify function. A 440-line orchestrator that calls them in explicit order. The regions array is still shared and mutable by reference, but the ordering is now visible in the orchestrator's call sequence rather than implicit in variable scope.

The divider detector was moved to step 11.5, after text classification. Phantom dividers stopped appearing.

See the deepdive post for the full before/after code.


The Lesson

Step ordering bugs in a monolith are invisible until the monolith is split. The split forces you to write out the order explicitly, and the explicit order makes it obvious when something is in the wrong place.

Read this post in the full Engineering Journal →