Engineering Journal
Pdf Processor
Pdf Processor

Phantom Dividers: When Step 9 Ran Before Step 11

2026-05-31

TLDR

Bug: spurious hr.pdf-divider elements in extracted HTML. Cause: detectDividers ran at step 9, before text regions existed at step 11. Fix: moved divider detection to step 11.5.

Repo: tools/pdf-processor

Symptom

Extracted HTML contained unexpected <hr class="pdf-divider"> elements inside what should have been paragraph text. On documents with decorative horizontal rules near body text, dividers were being inserted mid-paragraph.


Root Cause

detectDividers checked whether a horizontal segment was "already inside a region" before claiming it as a divider. The check was correct. The input was wrong.

At step 9, the regions array only contained image, lattice, stream, and box regions. Paragraph, heading, and list regions were created at step 11. Any long H-segment inside a paragraph area saw an empty neighborhood and passed the containment check cleanly.

// step 9 -- regions only has image/lattice/stream/box at this point
function detectDividers(segments, regions, pageGraph) {
  return segments.hLines.filter(seg => {
    const inside = regions.some(r => containsBbox(r.bbox, seg));
    return !inside; // always true for paragraph-area segments
  });
}

// step 11 -- text classification runs here // paragraphs, headings, lists are added to regions

The fix was to move divider detection to step 11.5, after text classification populated the full region list.

// correct order in the orchestrator
const textRegions = classifyText(items, columns, pageGraph); // step 11
regions.push(...textRegions);

const dividers = detectDividers(segments, regions, pageGraph); // step 11.5 regions.push(...dividers);


Guard

Any classifier that depends on the region list being complete must run after all classifiers that populate it. Document this constraint in the orchestrator with a comment on the call site. Step ordering is now explicit in the orchestrator function body, so moving a call is a visible change to a visible sequence, not an implicit behavior change inside a 1150-line monolith.


Lesson

A containment check is only as correct as the region list it queries. Verify that all dependent data structures are fully populated before running any classifier that reads them.

Read this post in the full Engineering Journal →