Image Extraction Postmortem
Post-mortem: Five Bugs in the PDF Image Extraction Pipeline
TLDR: The image extraction pipeline went through five distinct failure modes before it worked correctly. Each fix uncovered the next problem. The root causes were all wrong assumptions about how PDF.js works internally — and none of them were documented anywhere obvious.
What We Were Building
The PDF Processor renders extracted HTML from a PDF. Paragraphs, tables, headings — all correctly assembled from the vector operator list. Images were the last piece. The plan was straightforward: detect image XObjects in the operator list, crop them from a rendered canvas, embed as base64 data URLs.
The plan took four separate sessions to complete.
Bug 1: The Worker Crash
The first attempt called page.render() inside the geometry Web Worker to get pixel data from the canvas. It crashed immediately:
Cannot read properties of undefined (reading 'createElement')
The error was thrown deep inside PDF.js, from DOMCanvasFactory.create(). The default canvas factory does document.createElement('canvas') to allocate scratch canvases during rendering. In a Web Worker, document does not exist.
First wrong assumption: that you could just call page.render() inside a worker and it would figure out the environment.
The fix attempt was to pass a canvasFactory instance to getDocument():
const canvasFactory = { create(w,h) { return { canvas: new OffscreenCanvas(w,h), context: ... }; } };
pdfjsLib.getDocument({ data: bytes, canvasFactory });
This was silently ignored. PDF.js does not use an instance. It uses a class:
const CanvasFactory = src.CanvasFactory || DefaultCanvasFactory;
// ...
new CanvasFactory({ ownerDocument, enableHWA });
The option key is CanvasFactory with a capital C, and the value must be a constructor. Passing a plain object instance as a lowercase canvasFactory option does nothing — PDF.js falls through to DefaultCanvasFactory, which uses document.createElement.
The correct fix:
class OffscreenCanvasFactory {
create(width, height) {
const canvas = new OffscreenCanvas(width, height);
return { canvas, context: canvas.getContext('2d') };
}
reset(canvasAndCtx, width, height) {
canvasAndCtx.canvas.width = width;
canvasAndCtx.canvas.height = height;
}
destroy(canvasAndCtx) {
canvasAndCtx.canvas.width = 0;
canvasAndCtx.canvas.height = 0;
canvasAndCtx.canvas = null;
canvasAndCtx.context = null;
}
}
const pdf = await pdfjsLib.getDocument({ data: bytes, CanvasFactory: OffscreenCanvasFactory }).promise;
Bug 2: 731 Image Regions Per Page
Once rendering worked, images appeared in the output — but as hundreds of small divs scattered across the page. A single chart page reported 731 image regions.
The cause: every paintImageXObject operator in the operator list becomes one imageMeta entry, and each entry becomes one region. Engineering PDFs with icon-heavy figures have dozens of XObject calls per chart: axis tick marks, legend icons, small clipart elements.
Wrong assumption: that each image XObject corresponds to a distinct, meaningful visual element.
The fix was two-part. First, filter out images smaller than 20x20 viewport pixels — decorative elements, tick marks, rule lines. Second, merge adjacent image bboxes within an 8px gap into composite clusters:
const mergedImages = [];
for (const img of significantImages) {
const { x, y, w, h } = img.bbox;
const cluster = mergedImages.find(c =>
x <= c.right + MERGE_GAP && right >= c.x - MERGE_GAP &&
y <= c.bottom + MERGE_GAP && bottom >= c.y - MERGE_GAP
);
if (cluster) {
cluster.x = Math.min(cluster.x, x); cluster.right = Math.max(cluster.right, right);
cluster.y = Math.min(cluster.y, y); cluster.bottom = Math.max(cluster.bottom, bottom);
} else {
mergedImages.push({ id: img.id, x, y, right, bottom });
}
}
731 regions collapsed to 3 meaningful image blocks.
Bug 3: Blurriness
Images extracted at scale: 1.5 (144 DPI) looked blurry when CSS-stretched to fill their displayed size. Raising the global geometry scale to scale: 2.0 helped for text but images were still blurry because the crop dimensions were still small.
Wrong assumption: that raising the global scale was sufficient for images.
The fix was a separate 4x render exclusively for image crops. The geometry pipeline stays at scale: 2.0 for text and segment coordinates. Image extraction uses its own scale: 4.0 render (384 DPI), and upRatio = 2.0 converts 2.0-scale bbox coordinates to 4.0-scale canvas pixels:
const IMG_SCALE = 4.0;
const upRatio = IMG_SCALE / 2.0; // bbox coords are at scale 2.0
const imgViewport = page.getViewport({ scale: IMG_SCALE });
The two renders are intentional: you want dense pixels for images and you do not want to pay that cost for every text segment and path in the geometry pipeline.
Bug 4: Always 100% Wide
All images rendered at 100% container width regardless of how small the actual image was. A small logo occupied the full page width.
The fix: compute a proportional width from the bbox relative to the page viewport width, and use max-width: 100% instead of width: 100%.
Bug 5: Letterboxing
Images were correct but surrounded by blank space. The proportional wrapper created a letterboxed container from the bbox dimensions, then the image content didn't fill it.
Wrong assumption: that the bbox from ctmAdapter is a reliable proxy for the image content dimensions.
The bbox is the CTM-transformed 1x1 destination rectangle — where the image is painted on the page. It is not the native pixel dimensions of the image content. They are often different.
The fix: store the actual crop pixel dimensions alongside the data URL, then use them directly as the displayed size:
extractedImages[meta.id] = {
dataUrl: 'data:image/png;base64,' + btoa(binary),
pw: sw, // actual pixel width at 4x scale
ph: sh,
};
// In assembler: const natW = Math.round(imgEntry.pw / 4); const natH = Math.round(imgEntry.ph / 4); <img width="${natW}" height="${natH}" style="max-width: 100%; height: auto; display: block;">
No object-fit, no wrapper aspect-ratio. The crop is the truth.
What Survived the Planned Approach
The high-level strategy (render the page, crop bboxes, encode as base64) was always correct. PDF.js genuinely does not expose native image pixel data without a render — page.objs stays empty until the rendering pipeline runs. The canvas approach is what every real PDF library does at this level.
What got thrown away: the idea that one global scale handles everything, the idea that imageMeta entries map to visual objects, and the idea that bbox dimensions are image dimensions.
Lessons
- PDF.js option keys are case-sensitive and some expect class constructors, not instances. Check the source.
- The operator list is a paint API, not an object API. The same visual image can appear as dozens of separate XObject calls.
- Two separate viewport scales are better than one compromise scale when quality requirements differ between content types.
- Store source-of-truth dimensions at extraction time. Do not derive them later from layout coordinates.