Pdf Processor

Coordinate Spaces Are Not Optional: A Silent Bug in PDF Extraction

2026-07-11

TLDR: PDF.js gives you text positions in PDF user space and text widths in PDF points, but viewport-space coordinates are in screen pixels. Add them in the same expression and you get a value 33% off at scale 1.5x. Four downstream heuristics (underline detection, column coverage, Y-band tolerance, paragraph gap) silently produced wrong output. Fix: derive scale from the viewport matrix and keep viewport-space values alongside the originals on every record. This is part 1 of a four-part series on silent math bugs in our PDF editor.

Repo: tools/pdf-processor

The setup

The PDF processor uses pdfjs-dist to render and parse PDFs. PDF.js gives you two things on every page: a list of vector path operators (opList) and a list of text items (textContent.items).

Both come with positional information. Both use PDF user space. Y origin is bottom-left. Y increases upward. Coordinates are in points (1/72 inch).

The viewport is what you actually paint. We render at scale 1.5x, so the viewport's transform looks like:

viewport.transform = [1.5, 0, 0, -1.5, 0, height]

The -1.5 flips Y so the screen origin is top-left. The matrix takes a point in PDF user space and produces a point in viewport (screen) space.

Our extraction pipeline has two coordinate-space discontinuities:

Vector segments are baked through the viewport transform inside ctmAdapter.js. They emerge in viewport space.
Text items stay in PDF user space. We transform their positions on demand with viewport.convertToViewportPoint.

That second part is the trap.

The trap

PDF.js gives you item.transform[4] and item.transform[5] as the text's PDF-space position. It also gives you item.width and a font size you can derive from item.transform[3]. Those are not transformed by the viewport. The position is in PDF points. The width is in PDF points. The font size is in PDF points.

If you transform the position to viewport space and then add the untransformed width to it, you have just compared apples to oranges.

In our case, this looked like:

// inside contextClassifier (the buggy version)
const [vx, vy] = toViewport(vpT, item.transform[4], item.transform[5]);
const width = item.width || (fontSize  0.5  (item.str.length || 1));
return { idx, vx, vy, fontSize, width };

Then later, in the underline-detection heuristic:

const textXEnd = tm.vx + tm.width;   // viewport-X + PDF-points-width
if (yDist >= -1 && yDist <= 5 &&
    tm.vx <= hXMax + 2 && textXEnd >= hXMin - 2) {
  // ...
}

tm.vx is in viewport pixels. tm.width is in PDF points. At scale 1.5, the addition produces a value that is 33 percent narrower than the actual visual extent of the text. Most underlines escaped detection because the math thought the text didn't reach far enough to overlap them.

Four broken heuristics, one root cause

The same mismatch broke:

The column-coverage map (which fills a screen-pixel-indexed array with text widths in PDF points)
The Y-band tolerance (bodyFontSize * 0.45 applied to viewport-Y values)
The paragraph-gap detector (bodyFontSize * 1.8 applied to viewport-Y differences)
The underline detector above

Four heuristics, one root cause: a unit mismatch hidden inside an addition.

The fix

// derive the effective scale from the viewport's column vectors
const scaleX = Math.hypot(vpT[0], vpT[1]) || 1;
const scaleY = Math.hypot(vpT[2], vpT[3]) || 1;
const textMeta = textItems.map((item, idx) => {   const [vx, vy] = toViewport(vpT, item.transform[4], item.transform[5]);   const fontSizePt = Math.abs(item.transform[3] || 12);   const widthPt = item.width || (fontSizePt  0.5  (item.str.length || 1));   return {     idx,     vx, vy,     vWidth: widthPt * scaleX,    // viewport pixels for vx-relative checks     vFont:  fontSizePt * scaleY, // viewport pixels for vy-relative checks     fontSize: fontSizePt,         // PDF points for ratio comparisons   }; });

We keep both. vWidth and vFont are viewport pixels and get used wherever a comparison hits a viewport-space coordinate. fontSize stays in PDF points and gets used wherever the comparison is a ratio between two font sizes (heading detection compares lineFontSize / bodyFontSize, where both are in the same unit and the unit cancels out).

The lesson

Small but with teeth: when an SDK gives you positions in one space and dimensions in another, never mix them in a single arithmetic expression. Keep the converted values right next to the originals on every record.

This bug survived smoke tests because nothing crashed. The underlines were missing but the page still rendered. The columns were detected but with subtle off-by-fractions. Quiet bugs are the expensive ones.

Next in the series: Why semantic and spatial layouts can't share a ruler. Same problem in different clothes.

Read this post in the full Engineering Journal →