Pdf Processor

When a Module Grows Past Readability: Finding the Seams in a 1150-Line Pipeline

2026-05-31

TLDR

A pipeline stage that handles 11 distinct classification jobs is not one module. It is 11 modules that have not been separated yet. When you split along actual domain boundaries instead of line counts, ordering bugs that were invisible in the monolith surface immediately because the order is now written down explicitly.

Repo: tools/pdf-processor

The Problem Class

Any developer who has worked on a data pipeline long enough has a file like this. It starts at 300 lines and grows by appending. Each addition makes sense in isolation. The file ends at 1150 lines and nobody can fully reason about it.

The standard response is to extract utilities. Pull the helper functions into a shared file, leave the main flow intact. This produces a 1100-line file with cleaner helpers. The problem was never the helpers.

The actual problem: each classification domain is interleaved with every other domain. There is no clear boundary between "detect columns" and "use column knowledge to detect tables." Each domain reads state written by the domains above it in ways documented only by the position of the code. The file is not a module. It is 11 modules sharing one scope.

Finding the Seams

The right split boundary is not a line count or a complexity metric. It is a domain boundary: the point where one set of inputs produces one set of outputs that another domain depends on.

In a document layout pipeline the domains are distinct: column detection, underline detection, image region detection, table detection (lattice and stream), box detection, divider detection, heading detection, list detection, header/footer detection, region type assignment. Each produces typed outputs. Each can be tested with a document and a viewport alone.

The test for a real seam: can you write a unit test for this piece that does not require loading the surrounding code? If yes, it is a module. If no, it has entangled dependencies that the split will surface.

Why Implicit Ordering Hides Bugs

The immediate benefit of the split was not cleaner code. It was that ordering constraints became visible.

In the monolith, divider detection ran at step 9. Text classification (paragraphs, headings, lists) ran at step 11. This was not documented anywhere. The containment check inside divider detection asked: "is this segment already inside a known region?" At step 9, the region list only contained image, lattice, stream, and box entries. Paragraph regions did not exist yet.

Every long horizontal segment inside a paragraph area passed the containment check cleanly and became a spurious divider. The bug had been shipping for multiple sessions. On documents with few horizontal rules near body text it was invisible. On documents with decorative rules in body text, dividers appeared mid-paragraph.

Once the pipeline was written out as an explicit sequence of function calls, the bug was obvious:

// before: implicit ordering, no way to audit it
function classifyPage(items, segments, viewport) {
  // 1150 lines: divider detection somewhere around line 600
  // text classification somewhere around line 900
  // no one can tell which runs first without reading both
}
// after: explicit sequence in the orchestrator const columns   = detectColumns(items, viewport); const lattice   = detectLattice(segments, pageGraph); const images    = detectImages(items, viewport); const boxes     = detectBoxes(segments, pageGraph); const stream    = detectStreamTables(items, columns, pageGraph); const text      = classifyText(items, columns, pageGraph);   // step 11 regions.push(...text); const dividers  = detectDividers(segments, regions, pageGraph); // step 11.5

The fix was moving divider detection one line down in the orchestrator. The seam-finding is what made the bug findable.

The Module Interface

Each classifier module exports a single function. It takes explicit arguments and returns typed outputs. It reads nothing from module scope.

// underlineDetector.js
export function detectUnderlines(items, segments, viewport) {
  // ... returns UnderlineRegion[]
}
// headingDetector.js export function classifyHeadings(textItems, columns, fontStats) {   // ... returns HeadingRegion[] }

The shared spatial index (PageGraph) is constructed once in the orchestrator after segment classification and passed as an argument to every classifier that needs proximity queries. Before the split, each classifier ran its own ad-hoc proximity queries against raw arrays. After the split, every classifier uses the same index, built once.

Tradeoffs

The modules still run sequentially and share a mutable region array passed by reference. Each classifier can see what earlier classifiers produced without a copy. The tradeoff: ordering still matters. The improvement is that ordering is now explicit in the orchestrator's call sequence rather than implicit in variable shadowing 600 lines apart.

This is not a functional pipeline with immutable data. That would require a different performance tradeoff. For a pipeline that runs in a browser worker on every page render, the mutable pass-by-reference model is the right tradeoff. The explicit ordering is the safety property that was missing.

The One Thing to Watch For

Any classifier that depends on the region list being complete must be positioned after all classifiers that populate it. This is now visible in the orchestrator as a call-sequence constraint. Moving a classifier earlier is a visible code change to a visible sequence. In the old monolith it was invisible. Document this constraint with a comment at the call site, not just in the module itself.

Read this post in the full Engineering Journal →