Pdf Processor

Confidence Scores Are the Wrong Default for Pipeline Observability

2026-05-31

TLDR

Confidence scores are the right UI for probabilistic models. For deterministic geometric pipelines, they're theater. Expose the thresholds that drove the decision instead. Users can reason about physical measurements. They cannot reason about composite scores.

The confidence score default

When building the Analysis panel, the first instinct was to expose STREAM_CONFIDENCE as the stream table slider. It's a number between 0 and 1. It has a threshold (0.60 default). Drag the slider and more or fewer stream tables appear. Clean, familiar, makes sense at first glance.

Except STREAM_CONFIDENCE is computed as:

const confidence = (colAlignScore + rowSpacingScore) / 2;

Where colAlignScore is 1 - stdDev(column X positions) / colTolPx, and rowSpacingScore is 1 - stdDev(row gaps) / mean(row gaps). A user looking at this number on a canvas overlay has no path from "0.63" to "what do I change to make this work."

Compare to R_COL_TOL, the column anchor cluster radius. Raise it from 0.8 to 1.5 and items that are 1.5 font-sizes apart in X can land in the same column anchor. Visible consequence: loose columns merge. Drag the slider on the canvas and you can see the ghost overlay change as column anchor widths shift. The threshold has a geometric meaning you can observe.

The industry gets this backwards

Every ML paper publishes precision-recall curves. Every YOLO implementation exposes a confidence threshold slider. So when you build a detection system, you default to exposing confidence. It feels rigorous.

But the stream table detector is not a neural network. It's a five-step geometric algorithm:

Group text items into Y-bands
Cluster Y-band groups by adaptive inter-band gap
Find X clusters present in 2+ bands (column anchors)
Gate on fill rate, average text length, items per band
Score on column alignment variance + row spacing variance

Steps 1-4 are deterministic gates. Step 5 produces a score. The score is downstream of all the geometry. Exposing the score as the primary control is exposing the output of the algorithm as a knob on the algorithm: a feedback loop with no intuition.

The right controls are the inputs:

Step 1: R_Y_BAND (already shared with page-level grouping)
Step 3: R_COL_TOL (cluster radius)
Step 3: minimum anchor count (not worth exposing; 3 is correct for almost every table)
Steps 1-2: R_STREAM_GAP (section break gap; adaptive, self-calibrating, not needed)

The score gate (STREAM_CONFIDENCE at 0.60) should be a hard floor. It's a guard against degenerate candidates. It's not a tuning knob.

When confidence scores are the right UI

Confidence matters when the model is making a probabilistic prediction and you want to set the precision-recall operating point. If a region could be a table or prose and the detector genuinely doesn't know, the confidence score captures that uncertainty. The user setting a higher threshold says "I'd rather miss some tables than accept false positives."

In the current pipeline, the only classifier that genuinely produces a probabilistic score with real uncertainty is the stream table detector. Even then, the uncertainty is better expressed through its input parameters than its output score. A low confidence stream table usually means the column anchors are misaligned or the row spacing is irregular. Both of those have direct threshold controls.

The lattice table detector has no confidence score because there's nothing uncertain about it: either the segments form a closed grid or they don't. The paragraph detector has no confidence score because a text band either has a large enough Y-gap to be a paragraph break or it doesn't.

Confidence scores belong in discriminative classifiers. In deterministic geometric pipelines, expose the thresholds and let users see the geometry.

The ghost overlay as the alternative

The ghost overlay is the actual answer to what confidence scores are trying to solve. Instead of telling users "this region scored 0.63," it shows them on the canvas what the threshold looks like geometrically as they drag. Y-band tolerance shows bracket lines. Para-gap shows break threshold lines. Col-gap shows the minimum gutter width. Stream col tolerance shows column anchor widths.

The user sees the geometry change as they drag. They understand what the number means because they can see its physical consequence on the page they're looking at. They can make an informed decision before firing a re-extraction.

That's harder to build than a confidence score slider. It's also the only UI that actually helps.

Read this post in the full Engineering Journal →