Pdf Processor

Making a Processing Pipeline Observable: Region Manifests, Skip Gates, and Ghost Overlays

2026-05-31

TLDR

A pipeline that produces output without exposing intermediate state is a black box to the user. Adding observability means: emitting a typed manifest of what each stage produced, exposing threshold controls with live previews, and enabling single-item re-runs without reprocessing the whole input. None of these require changing the core algorithm.

Repo: tools/pdf-processor

The Problem Class

Any developer who has built a multi-stage document processing pipeline has seen this: the user gets output, something is wrong, and they have no way to know which stage produced it. Was it the detector? The classifier? The assembler? The user adjusts a parameter and re-runs, hoping it moved in the right direction.

The problem is not the algorithm. The problem is missing intermediate state. Observability is not a feature you add to the UI. It is a contract change in the pipeline's output.

The Region Manifest

The first contract change: every pipeline run emits a typed manifest of what each classification stage produced, alongside the final output.

Before this change, the worker posted only { html, text, tables } per page. The regions array was computed internally and discarded. After:

// Worker postMessage payload
{
  html: '...',
  regions: regions.map((r, i) => ({
    id: r.id || p${page}-r${i},
    type: r.type,           // LATTICE_TABLE, PARAGRAPH, HEADING, etc.
    bbox: r.bbox,           // viewport-space bounding box
    algorithm: r.algorithm ?? 'geometric',
    confidence: r.confidence ?? 1.0,
    columnIndex: r.columnIndex ?? -1,
  })),
  pageScale: scale.toJSON(), // S (body font size in px), derived tolerances
}

pageScale.toJSON() serializes the calibrated thresholds for this specific page: body font size in viewport pixels and all the absolute tolerances derived from it. This is what makes the readout useful: it shows the numbers the pipeline actually used, not the ratio defaults from config.

Pipeline Skip Gates

The second contract change: every classification stage can be individually bypassed via a skip set passed into the orchestrator.

const skip = opts.pipeline?.skip ?? new Set();
if (!skip.has('LATTICE_TABLE')) {   const latticeRegions = detectLatticeTables(segments, pageGraph);   regions.push(...latticeRegions); } if (!skip.has('STREAM_TABLE')) {   const streamTables = detectStreamTableRegions(items, columns, pageGraph);   regions.push(...streamTables); }

When a type is in the skip set, the classifier does not run. Its content stays unclaimed and falls through to the next stage. This is different from suppressing output: it changes how the pipeline claims content, which affects everything downstream.

The important distinction: skipping PARAGRAPH omits prose from output entirely. Skipping LATTICE_TABLE means bordered table content becomes unclaimed segments and gets reclassified as paragraphs. Both are valid diagnostic modes.

Scale Override: Per-Page Threshold Tuning

The exposed slider controls map directly to the pipeline's ratio properties. They are applied after the body font calibration step, so the override is in ratio space, not absolute pixels:

const so = opts.pipeline?.scaleOverrides;
if (so) {
  if (so.R_Y_BAND     !== undefined) scale.R_Y_BAND     = so.R_Y_BAND;
  if (so.R_PARA_GAP   !== undefined) scale.R_PARA_GAP   = so.R_PARA_GAP;
  if (so.R_COL_GAP_MIN !== undefined) scale.R_COL_GAP_MIN = so.R_COL_GAP_MIN;
}

The pageScale readout in the UI shows the resulting absolute values. A slider at 0.45 on a document with a 14px body font produces different pixel tolerances than the same slider on a document with a 10px body font. The ratio is the user-facing control. The absolute value is what the pipeline actually uses.

Ghost Overlays: Visualizing Thresholds Without Re-Running

The most useful diagnostic feature is not re-extraction. It is the ghost overlay that shows what a threshold change would affect before re-running. As the user drags a slider, a second pass draws geometric previews on the canvas using the current slider value.

For the line-grouping tolerance:

if (ghostType === 'yband') {
  const tolPx = bodyFontSize  scaleOverrides.R_Y_BAND  renderScale;
  ctx.strokeStyle = 'rgba(232,121,249,0.7)';
  ctx.setLineDash([3, 3]);
  for (const r of pageData.regions) {
    if (!r.bbox) continue;
    const ry = r.bbox.y * renderScale;
    ctx.beginPath();
    ctx.moveTo(0, ry - tolPx);
    ctx.lineTo(canvas.width, ry - tolPx);
    ctx.stroke();
    ctx.beginPath();
    ctx.moveTo(0, ry + tolPx);
    ctx.lineTo(canvas.width, ry + tolPx);
    ctx.stroke();
  }
}

Bracket lines show the tolerance band around each region's Y baseline. Drag the slider up and the brackets widen, so more items will group into the same text band. Drag down and they narrow. The effect is visible on the current page without firing re-extraction.

For confidence-gated regions, the overlay colors each candidate pass/fail based on the current slider value. The user can see which regions would survive or be rejected at the current threshold before committing to a re-run.

Single-Page Re-Extraction

Re-running the full pipeline on a 50-page document to test a one-slider change on page 3 is not viable. The single-page re-extract path reruns the full per-page pipeline on the requested page only, using the override parameters, and patches just that page in the live output:

// Main thread: only the target page section is replaced
function patchPageInOutput(pageNum, newHtml) {
  const existing = outputContainer.querySelector([data-page="${pageNum}"]);
  if (!existing) return;
  const replacement = parsePageSection(newHtml);
  existing.replaceWith(replacement);
  // Re-sync Monaco editor and visual diff surface from updated DOM
  editorModel.setValue(outputContainer.innerHTML);
}

Other pages are untouched. The editor stays in sync. The user sees the corrected page in place without a full reload.

What to Watch For

The ghost overlay converts between coordinate spaces at draw time: region bboxes are in worker viewport space, canvas pixels depend on the current display size. If the canvas is resized between when the page data was fetched and when the ghost draws, the overlay remains correct because it converts from stored coordinates at draw time, not at fetch time. The canvas width at draw time is what matters.

Read this post in the full Engineering Journal →