Direct Manipulation Beats Global Thresholds for Spatial Correction
TLDR
When a pipeline gets a specific region wrong, adding a global threshold to fix it is the wrong abstraction. Direct manipulation: click the region, press a key, done. No threshold pollution, no regression risk on other regions.
Repo: tools/pdf-processor
What the Industry Does
Tunable parameters. Confidence thresholds, sensitivity sliders, min/max bounds. Most PDF extraction tools ship with a settings panel where you adjust global values and re-run. The implicit model is: if the algorithm is wrong, it is wrong because a threshold is miscalibrated, and the right global value will fix all instances of that error.
This model works for systematic errors. If the column gap threshold is too small and every document is splitting into three columns instead of two, a global fix is correct.
Why It Is Wrong for This Problem
Regional extraction errors are often structural, not threshold-based. A text block absorbs into an adjacent table because those two elements share a column boundary at that specific page position. The column gap at that location is legitimately small. There is no global threshold value that correctly separates them here while leaving the rest of the document intact.
Adding a threshold to fix a structural error shifts the problem: now you need to find the threshold range where this region is correct without breaking the five other regions that depend on the same threshold. On a complex document, no such range exists.
The Better Approach
Let the pipeline run automatically. On the minority of regions it gets wrong, give the user a direct handle: click the region, drag to resize, press a key to reclassify. The correction is local and scoped. Nothing else changes.
For the column detector specifically: when automatic detection fails on a particular document, let the user draw the split line where they know it should be. The pipeline skips detection entirely for that document and uses the user's line as ground truth.
if (manualSplits && manualSplits.length > 0) {
// bypass all automatic column detection
return buildSplitsFromManual(manualSplits, vpWidth);
}
// else run bipartite scan as normal
The Tradeoff You Are Accepting
Manual corrections are session-local. They do not persist across page reloads. If persistence matters for your use case, you need a correction store, which adds complexity. The current implementation accepts the tradeoff: the pipeline should be good enough that re-correcting is rare, not routine.
Direct manipulation also requires a canvas-based UI. That is more implementation work than a slider. If your tool runs in a non-visual context, this approach is not available.
Where the Common Pattern Is Correct
Global thresholds are correct when the error is systematic and the threshold is meaningful across documents. Row band tolerance, paragraph gap threshold, column gap minimum: these are legitimate global values because they describe properties of the document layout model, not of specific regions. They belong in the pipeline. They are just not sufficient for per-region correction.
Use thresholds for model calibration. Use direct manipulation for regional correction. They solve different problems.