Postmortem: When the PDF Pipeline Needed a Manual Override
TLDR
The automatic pipeline got most documents right but had no correction path for edge cases. Users adjusted sliders hoping the change moved in the right direction. The region editor replaced slider-hunting with direct spatial correction.
Repo: tools/pdf-processor
What We Built
A PDF extraction pipeline with five tunable thresholds: row band tolerance, paragraph gap, column gap minimum, stream confidence, and minimum confidence filter. The sliders showed ghost overlays on the canvas while dragging so users could see the effect before committing. Re-extraction fired on button click with the current slider values.
What Failed
Thresholds are page-global. A paragraph gap threshold that fixes a two-column academic paper may fragment a single-column report. A stream confidence value that correctly separates tables from prose on page 3 may incorrectly merge them on page 7.
The more fundamental problem: some regions are wrong not because of a threshold value but because the algorithm made the wrong structural choice. A text block next to a narrow table sometimes gets absorbed into the table because the column gap at that position happens to be small. No global threshold can fix that without breaking other pages.
The correction workflow was: adjust slider, re-extract, inspect canvas, adjust slider again. Sometimes the right answer required knowing the threshold range where the extraction was in the correct state and the wrong state simultaneously for different regions on the same page.
What We Threw Away
Slider-only correction. The ghost overlays stay. The sliders stay. But they are no longer the only path.
What Survived
The pipeline itself. Automatic detection is correct on most documents and most pages. The region editor is a correction layer on top of a working pipeline, not a replacement for it.
The ghost overlay system for sliders survived intact and is complementary: sliders help tune the global thresholds, the editor corrects the remaining per-region errors.
What Replaced It
Direct spatial editing on the canvas. Select a region, drag its handles, press a key to reclassify it, press delete to exclude it from the next extraction. Place column split lines to override the automatic column detector on documents where the gutter detection fails.
The correction is local and direct. No global threshold is touched. Other regions are unaffected.
The Lesson
When a system has multiple tunable parameters but users need to correct a specific region, give them a direct manipulation interface for that region. Global parameters are the wrong abstraction for local errors.