Pdf Processor

Postmortem: Five Wrong Assumptions About PDF.js Image Extraction

2026-05-31

TLDR

The high-level strategy (render a page to canvas, crop image regions, encode as base64) was always correct. What failed five times were assumptions about PDF.js internals: how its canvas factory works, what an image XObject is, how scale affects quality, and what a bounding box represents. None of these are documented in obvious places.

Repo: tools/pdf-processor

The Assumption That Seemed Reasonable (Five Times)

Extracting images from a PDF in a browser worker is not conceptually hard. Find image regions in the operator list, render the page to a canvas, crop the regions, encode to base64. Four sessions to make it work.

Bug 1: The Canvas Factory Option Is a Class, Not an Instance

Assumption: passing a canvas factory object to getDocument() makes PDF.js use it.

PDF.js needs a canvas implementation to render pages. In a Web Worker, document.createElement('canvas') does not exist. The fix was to pass a custom factory that uses OffscreenCanvas. This is the right approach. The wrong part was in the option name and value type.

PDF.js checks for a constructor, not an instance. The option key is CanvasFactory with a capital C:

// wrong: silently ignored
const canvasFactory = { create(w, h) { return { canvas: new OffscreenCanvas(w, h) }; } };
pdfjsLib.getDocument({ data: bytes, canvasFactory });
// correct: PDF.js will instantiate this class class OffscreenCanvasFactory {   create(width, height) {     const canvas = new OffscreenCanvas(width, height);     return { canvas, context: canvas.getContext('2d') };   }   reset(canvasAndCtx, w, h) { canvasAndCtx.canvas.width = w; canvasAndCtx.canvas.height = h; }   destroy(canvasAndCtx) { canvasAndCtx.canvas.width = 0; canvasAndCtx.canvas.height = 0; } } pdfjsLib.getDocument({ data: bytes, CanvasFactory: OffscreenCanvasFactory });

Passing a lowercase canvasFactory instance does nothing. PDF.js falls through to DefaultCanvasFactory, which calls document.createElement, which crashes in a worker.

Bug 2: One XObject Is Not One Visual Image

Assumption: each paintImageXObject operator in the PDF operator list corresponds to one visual image.

After fixing the canvas factory, images appeared: 731 of them on a single chart page. A single engineering figure with tick marks, icons, and legend elements is drawn as hundreds of separate XObject calls in the PDF operator stream.

The operator list is a paint API. It does not model visual objects. The same visual image can appear as dozens of operators.

Fix: filter out elements smaller than a minimum size threshold, then merge adjacent bboxes within a gap tolerance into composite clusters. 731 operators collapsed to 3 meaningful image blocks.

Bug 3: One Global Scale Does Not Handle Both Text and Image Quality

Assumption: rendering at a higher global scale produces better image quality.

The geometry pipeline ran at scale 2.0 (192 DPI). Images at 2.0 looked blurry when CSS-stretched to fill their displayed size. Raising the global scale helped text and segment detection but not image quality, because the crop dimensions were still smaller than what a high-DPI display needs.

The fix: a separate 4x render exclusively for image crops. The geometry pipeline stays at 2.0 for text. Image extraction uses 4.0, with a ratio constant converting 2.0-scale bbox coordinates to 4.0-scale canvas pixels:

const IMG_SCALE = 4.0;
const upRatio = IMG_SCALE / 2.0; // bbox coords are at geometry scale
const imgViewport = page.getViewport({ scale: IMG_SCALE }); // crop: multiply bbox coords by upRatio before using as canvas pixel offsets

Two renders are intentional. The quality requirement for images (high-DPI crops) is different from the quality requirement for geometry (accurate positions).

Bug 4: Bbox Width Is Not Image Width

Assumption: the bbox from the operator list is a reliable proxy for image content dimensions, suitable for sizing the displayed element.

All images rendered at 100% container width regardless of actual image size. A small logo occupied the full page width. Using max-width: 100% instead of width: 100% was the right direction but produced letterboxing: blank space around the image inside its container.

The bbox is the CTM-transformed destination rectangle, which is where the image is painted on the page. It is not the pixel dimensions of the image content. They are often different, especially for images that are scaled or placed at non-1:1 ratios in the PDF.

Bug 5: Store the Crop Dimensions, Not the Bbox Dimensions

Assumption: the crop pixel dimensions can be derived later from layout coordinates.

They cannot reliably, because bbox and content size are different (see bug 4).

Fix: store actual crop pixel dimensions alongside the encoded image at extraction time:

extractedImages[id] = {
  dataUrl: 'data:image/png;base64,' + encoded,
  pw: cropPixelWidth,   // actual crop width at 4x scale
  ph: cropPixelHeight,
};
// At render time, use natural dimensions scaled back down const natW = Math.round(pw / 4); const natH = Math.round(ph / 4); // <img width="natW" height="natH" style="max-width: 100%; height: auto">

The crop is the source of truth for size. No object-fit, no aspect-ratio wrapper. The image fills exactly the space its content occupies.

What Survived

The render-and-crop strategy was correct from the start. PDF.js does not expose native image pixel data without rendering. The canvas approach is what every PDF library does at this abstraction level. What changed across five iterations was the accuracy of the implementation details, not the architecture.

The Lesson

PDF.js is a rendering engine, not a structured object API. Option keys are case-sensitive and some require class constructors. The operator list is a paint instruction stream, not an object manifest. Bboxes are destination rectangles in page space, not content dimensions. These are all findable in the source code. They are not documented in the README.

Read this post in the full Engineering Journal →