Pdf Processor

Postmortem: The Ordering Bug That Only a Refactor Could Find

2026-05-31

TLDR

A pipeline stage ran before the data it depended on was ready. The bug had been shipping for weeks. It was invisible because the ordering was implicit in a 1150-line file. The refactor did not find the bug by inspecting the algorithm. It found the bug by forcing the order to be written down.

Repo: tools/pdf-processor

The Assumption That Seemed Reasonable

Growing a pipeline file by appending is a rational strategy. When you add a new classification stage, you put it where it fits in the existing flow. You run the existing tests. Everything still passes. The file is now 50 lines longer.

Repeat this a dozen times over several months. The file is now 1150 lines. Each individual addition was correct. The whole is a maintenance problem.

The implicit assumption: because each addition passed the tests at the time it was added, the ordering is correct. This assumption is false. Tests can pass with a wrong ordering if the failure mode is document-dependent and the test documents do not exercise the failing case.

When It Failed

Extracted HTML was producing unexpected <hr> elements inside paragraph text. On simple documents with no horizontal rules near body text, everything looked clean. On documents with decorative rules inside body copy, spurious dividers appeared mid-paragraph.

The containment check inside divider detection was correct: "is this segment already inside a known region? If yes, skip it." The check was passing because at the time it ran, paragraph regions did not exist yet. The region list only contained image, table, and box entries. The paragraph that should have claimed the area had not been processed yet.

The ordering: divider detection ran at step 9. Text classification ran at step 11.

What Was Actually Wrong

The ordering was not documented. There was no place in the codebase where you could read the pipeline sequence and verify it. The sequence existed only as the top-to-bottom arrangement of code inside a 1150-line function.

Nobody wrote // NOTE: divider detection must run after text classification. Nobody reviewed the ordering as a property of the system. Ordering bugs in this structure are invisible by default.

The refactor forced the order into the open. When each stage is a separate function call in an orchestrator, the sequence is the orchestrator's function body. You can read it in ten lines. You can see immediately that divider detection preceded text classification. The fix was moving one line.

What Got Deleted

The single-file structure. All 1150 lines of the classifier were split across 11 focused modules. Not any individual algorithm. Not any detection logic. The structure that permitted implicit ordering is what was removed.

Also deleted: three duplicate regex patterns that existed in both the classifier and the assembler, several dead constants, and a tier-check that had been left in place after the gate was moved to the export layer.

What Replaced It

Eleven modules, each exporting one function. A 440-line orchestrator that calls them in an explicit sequence. The region array is still mutable and shared by reference for performance. The ordering is now written out as function calls in the orchestrator body, making it auditable and making ordering changes visible as code diffs.

The divider classifier moved to after text classification. Phantom dividers stopped appearing.

See the deepdive post for the full pipeline structure and code.

The Lesson

Implicit ordering is a silent invariant. Silent invariants do not appear in diffs, do not show up in tests unless a test document exercises the failure case, and do not warn you when someone moves them. Make ordering explicit. An orchestrator that calls stages in sequence is documentation. A 1150-line file is not.

Read this post in the full Engineering Journal →