Pdf Processor

A 1000-Line File Is Not a Code Problem. It Is a Design Problem.

2026-05-31

TLDR

Long files are a symptom, not the disease. The disease is implicit step dependencies between functions that share the same mutable scope. Splitting the file forces you to write the dependencies down, and writing them down makes the bugs obvious.

Repo: tools/pdf-processor

What the Industry Does

The standard response to a long file is to extract helpers. Move utility functions to a shared module. Add a utils.js. Keep the main flow intact because "that's where the logic lives." This produces a slightly shorter main file with better-organized utilities and exactly the same coupling.

The industry also reaches for a class: wrap the whole thing in a Classifier with instance state, call methods on this, and treat shared state as a feature ("it's all in one place"). This produces an organized-looking object that is harder to test than the flat function because now you need to instantiate the class and manage its lifetime.

Why It Is Wrong for This Problem

PDF classification is not a class with state. It is a pipeline with explicit data flow. Each step takes specific inputs and produces specific outputs. The steps have ordering constraints that exist in the domain, not in JavaScript.

When those steps share a mutable closure (or instance state), the ordering constraints become invisible. detectDividers at step 9 and classifyText at step 11 look equally valid to the reader because both are functions called in sequence. The constraint "dividers must run after text" is not in any type, not in any test, not in any interface. It is in the behavior of the output on specific documents. You discover it by testing the right document at the right moment.

When you split into modules with explicit arguments, the constraint becomes structural. detectDividers takes regions as an argument. Its correctness depends on regions being fully populated. That dependency is now visible in the function signature. Moving the call site is a change you can see and reason about.

The Better Approach

One module per detection domain. Each module exports one function. Each function takes explicit arguments and returns an explicit result. The orchestrator owns the ordering. Step ordering comments in the orchestrator are documentation, not decoration.

This is not a new idea. It is just a pipeline written as a pipeline rather than as a long procedure with implicit state.

The Tradeoff You Are Accepting

Eleven files to navigate instead of one. More import statements. More function call overhead (negligible for this workload). A mutable regions array that still crosses module boundaries by reference, because copying it on every step would add unnecessary overhead in a tight extraction loop.

Where the Common Pattern Is Correct

If you have genuinely shared state that multiple steps read and write in interleaved ways, a class or a shared closure is the right structure. A color picker that tracks mouse position, drag state, and canvas context across multiple event handlers needs shared state. A PDF page classifier that runs 11 sequential detection passes does not.

The question is always: is the shared scope there because the algorithm requires it, or because it was convenient when the first function was written and nobody cleaned it up?

Read this post in the full Engineering Journal →