How We Built the PDF Analysis Panel: From Region Manifest to Live Canvas Overlay
TLDR
The Analysis panel gives users live visibility into what the extraction pipeline actually did — which regions were detected, how confident the detector was, and which thresholds drove each decision. Re-extract page re-runs the pipeline on a single page with user-configured overrides, patches only that page in the live output, and leaves the rest of the document untouched. The whole system runs in the existing geometry worker with no new processes.
The problem it solves
PDF extraction is a black box from the user's perspective. A table gets missed. A paragraph becomes a heading. Two columns collapse into one. The user has no way to know whether the threshold was wrong, the detector didn't run, or the region simply wasn't emitted. The Analysis panel makes the pipeline observable.
Region manifest: the intermediate representation that was missing
Before this session, the geometry worker sent only { html, text, tables } per page. The regions array was computed inside the worker, passed to assemblePage, and immediately discarded. There was no way to inspect what the pipeline had classified.
The fix was a region manifest added to every 'page' postMessage:
regions: regions.map((r, i) => ({
id: r.id || p${p}-r${i},
type: r.type, // LATTICE_TABLE, PARAGRAPH, HEADING, etc.
bbox: r.bbox, // viewport-space bounding box
algorithm: r.algorithm ?? 'geometric',
confidence: r.confidence ?? 1.0,
columnIndex: r.columnIndex ?? -1,
imageId: r.imageId ?? null,
})),
pageScale: scale.toJSON(), // S, yBandTolPx, colGapMinPx, paraGapPx, etc.
pageScale.toJSON() serializes the calibrated thresholds for the current page — S (body font size in viewport pixels), all the derived absolute values. This is what the PageScale readout in the stats panel shows: the actual numbers the pipeline used, not the ratio defaults.
Pipeline skip gates
Each classifier step in the orchestrator is guarded by the skip set:
const skip = opts.pipeline?.skip ?? new Set();
if (!skip.has('LATTICE_TABLE')) { const latticeRegions = detectLatticeTables(...); for (const r of latticeRegions) regions.push(r); } if (!skip.has('STREAM_TABLE')) { const streamTables = detectStreamTableRegions(...); // ... }
When a type is skipped, the classifier doesn't run at all — its segments and text items stay unclaimed and fall through to the next tier. Skipping LATTICE_TABLE means bordered tables become unclaimed segments; the text items inside them become paragraphs. This is different from suppressing output — it changes how the pipeline claims content, which affects everything downstream.
HEADING and LIST skips work differently because they're inside _classifyBucket:
if (headingType && !skip.has('HEADING')) {
lineType = headingType;
} else if (listResult && !skip.has('LIST')) {
lineType = listResult.type;
} else {
if (skip.has('PARAGRAPH')) continue; // omit from output entirely
lineType = RegionType.PARAGRAPH;
}
Skipping PARAGRAPH is the escape hatch: if you want to see the page without any prose content — only tables, headings, images — paragraph items are simply not emitted.
Scale overrides: per-page threshold tuning
The four exposed sliders map directly to PageScale ratio properties:
const so = opts.pipeline?.scaleOverrides;
if (so) {
if (so.R_Y_BAND !== undefined) scale.R_Y_BAND = so.R_Y_BAND;
if (so.R_PARA_GAP !== undefined) scale.R_PARA_GAP = so.R_PARA_GAP;
if (so.R_COL_GAP_MIN !== undefined) scale.R_COL_GAP_MIN = so.R_COL_GAP_MIN;
if (so.STREAM_CONFIDENCE !== undefined) scale.STREAM_CONFIDENCE = so.STREAM_CONFIDENCE;
}
These overrides apply after PageScale's constructor runs, so S (body font calibration) is preserved. Only the ratios change. This means a slider at 0.45 (R_Y_BAND default) versus 0.80 translates to different pixel tolerances on different PDFs depending on their body font size — the override is applied in ratio space, not absolute pixels. The PageScale readout shows what the absolute values resolve to for this specific page.
Ghost overlays: showing what the slider does before re-extraction
The ghost overlay draws a second pass on the canvas while a slider is being dragged, using the current slider value to show what threshold that corresponds to geometrically.
For R_Y_BAND (line grouping tolerance):
if (_ghostType === 'yband') {
const tolPx = ps.S _scaleOverrides.R_Y_BAND rScale;
ctx.strokeStyle = 'rgba(232,121,249,0.7)';
ctx.setLineDash([3, 3]);
for (const r of pageData.regions) {
if (!r.bbox) continue;
const ry = r.bbox.y * rScale;
ctx.beginPath(); ctx.moveTo(0, ry - tolPx);
ctx.lineTo(canvas.width, ry - tolPx); ctx.stroke();
ctx.beginPath(); ctx.moveTo(0, ry + tolPx);
ctx.lineTo(canvas.width, ry + tolPx); ctx.stroke();
}
_ghostLabel(ctx, canvas, Y-band tol: ±${(ps.S * _scaleOverrides.R_Y_BAND).toFixed(1)}px);
}
Bracket lines show the ±tolerance around each region's Y baseline. Drag the slider up and the brackets widen — more items will group into the same visual line. Drag down and they tighten. The effect is visible on the current page data before any re-extraction fires.
For STREAM_CONFIDENCE (pass/fail per stream region):
for (const r of pageData.regions) {
if (r.type !== 'STREAM_TABLE' || !r.bbox) continue;
const pass = (r.confidence ?? 0) >= _scaleOverrides.STREAM_CONFIDENCE;
ctx.fillStyle = pass ? 'rgba(20,184,166,0.25)' : 'rgba(239,68,68,0.25)';
ctx.fillRect(rx, ry, rw, rh);
ctx.fillText(${r.confidence.toFixed(2)} ${pass ? '✓' : '✗'}, rx + 4, ry + 14);
}
Stream tables turn green or red as the slider moves, showing which would survive or be rejected at the current threshold. No re-extraction needed to understand the impact.
Single-page re-extraction without reprocessing the document
The worker caches the raw PDF bytes after the initial 'process' run. The re-extract message triggers _handleReprocess which:
- Opens the cached PDF
- Gets the single requested page
- Runs the full per-page pipeline (ctmAdapter, reconcile, image extraction, fontStyleMap, classifyPage with overrides, assemblePage)
- Posts back
{type: 'page', reprocess: true, page, html, regions, pageScale}
reprocess: true flag routes the message away from the normal HTML accumulation and into onReprocessResult, which calls patchPageHtml(pageNum, html):
export function patchPageHtml(pageNum, newHtml) {
for (const id of SURFACE_IDS) {
const container = document.getElementById(id);
const existing = container.querySelector([data-page="${pageNum}"]);
if (!existing) continue;
const newSection = / parse newHtml, find [data-page] /;
existing.replaceWith(newSection);
initTableFeatures(container);
hydrateImages(container);
}
// Rebuild Monaco model from updated html-preview
state.pdf1.extractedHTML = preview.innerHTML;
editor.getModel()?.setValue(state.pdf1.extractedHTML);
}
Only the <section data-page="N"> element is replaced. Other pages are untouched. Monaco is updated from the live DOM so the Editor tab stays in sync.
Canvas coordinate alignment
The geometry analyzer (pdfAnalyzer) runs at scale 1.5. The geometry worker runs at scale 2.0. Region bboxes from the worker are in 2.0-scale viewport space. The canvas is sized to fit 540px maximum width based on the 1.5-scale page dimensions. The correct scaling factor for overlaying regions:
const workerVpW = pg.widthPx * (2.0 / 1.5);
const regionScale = canvas.width / workerVpW;
This converts from the worker's 2.0-scale viewport to the canvas's scaled width in one step. The same factor applies to ghost overlay geometry.
What comes next
Phase 2 of the Analysis panel adds reclassify-by-click (click a region bbox on the canvas to override its type), per-page re-extraction status indicator (which pages have been modified from the original extraction), and image thumbnail on hover via an on-demand 'image-data' request/response channel separate from the main 'page' stream.