Splitting a 1150-Line Classifier into 11 Focused Modules
TLDR
contextClassifier.js grew to 1150 lines handling 11 detection jobs at once. Splitting it into 11 single-responsibility modules reduced the orchestrator to 440 lines, made each detector independently testable, and surfaced a step-ordering bug that had been silently producing phantom dividers on every document.
Repo: tools/pdf-processor
The Problem
The original contextClassifier.js classified page regions in one long function: column detection, underline detection, image region detection, lattice table detection, box detection, divider detection, heading detection, list detection, header/footer detection, stream table detection, and region type assignment all in sequence. Each section leaked state into the next. Changing the heading detector required reading 400 lines of preamble to understand what variables were already in scope.
The file had grown organically. At 1150 lines it was past the point where any single developer could hold the whole thing in working memory.
The Naive Attempt
The first instinct was to extract utilities: pull helpers like getBoundingBox and isBorderedRegion into a shared file, leaving the main flow intact. Wrong approach. The helpers were not the problem. The problem was that each detection domain was interleaved with every other domain. Extracting utilities produces a 1100-line file with cleaner helpers, not a maintainable system.
Why It Broke
The root issue was that contextClassifier.js was doing classification, orchestration, and spatial indexing in the same pass. There was no clear boundary between "detect columns" and "use column knowledge to detect tables." Detectors read state written by other detectors in ways documented only by the order of the code.
spatialGraph.PageGraph was being constructed but not threaded through to detectors. Each detector ran its own ad-hoc proximity queries against raw text arrays instead of using the index.
The Correct Model
Each detection domain became its own module in src/extraction/vector/classifiers/:
columnSplitDetector.js
underlineDetector.js
imageRegionDetector.js
latticeDetector.js
boxDetector.js
dividerDetector.js
headingDetector.js
listDetector.js
headerFooterDetector.js
streamTableDetector.js
regionTypes.js
Each module exports a single classify function and takes explicit arguments. No module reads from another module's scope. The orchestrator calls them in order and passes results through explicitly.
// before: implicit ordering, shared mutable scope
function classifyPage(items, segments, viewport) {
// 1150 lines of everything mixed together
// divider detection at step 9 before text classification
}
// after: explicit pipeline, each step isolated import { detectColumns } from './classifiers/columnSplitDetector.js'; import { detectDividers } from './classifiers/dividerDetector.js';
function classifyPage(items, segments, viewport, pageGraph) { const columns = detectColumns(items, viewport); const lattice = detectLattice(segments, pageGraph); // ... 9 more explicit steps ... const dividers = detectDividers(segments, regions, pageGraph); // step 11.5 }
The PageGraph instance is constructed once after segment classification and passed into every classifier that needs it. Inline proximity queries across the old monolith were replaced with pageGraph._textIndex.query(), pageGraph._segIndex.query(), and pageGraph.isBorderedRegion().
The divider detector move was the most consequential fix. In the monolith it ran at step 9, before text classification at step 11. At step 9 the regions array only contained image, lattice, stream, and box regions. Any long horizontal segment inside a paragraph area passed the "not inside any region" containment check and became a spurious DIVIDER. After the refactor, divider detection runs at step 11.5 with the full region list.
Two more fixes came out of the refactor. The dropcap guard (single-character items should never be classified as headings) was wired through classifyHeading in headingDetector.js. BULLET_RE and ORDERED_RE regex patterns are now canonical in listDetector.js instead of duplicated in two places.
Tradeoffs
The 11 modules still run sequentially and share a mutable regions array passed by reference. Each detector can see what earlier detectors claimed without an expensive copy. The tradeoff is that ordering still matters. The constraint is now explicit in the orchestrator's call sequence rather than buried in variable shadowing.
What to Watch For
detectDividers receives the regions array at step 11.5, after all text classifiers have run. Moving divider detection earlier for any reason will cause phantom dividers to reappear in documents where paragraph text contains long horizontal rules. The step ordering comment in the orchestrator is load-bearing, not cosmetic.