Pdf Processor

Image Extraction Deepdive

2026-07-11

Under the Hood: Extracting Images from a PDF in a Web Worker

TLDR: PDF.js can render images inside a Web Worker, but it requires replacing its canvas factory with an OffscreenCanvas implementation. Once that's done, the correct extraction model is: render the full page at high resolution, crop each image bbox from the rendered canvas, encode to base64. This is what every real PDF library does. The complexity is in getting the coordinate transforms and scale ratios right.

Why PDF.js Doesn't Just Give You Images

The natural question when building a PDF extractor is: why not call some getImages() API and get Blobs back?

PDF.js doesn't have that API, and the reason is architectural. Images in a PDF are paint commands. An image XObject appears in the operator list as paintImageXObject or paintJpegXObject, with the current CTM determining where and how large it gets drawn. The pixel data lives in the page's object dictionary, keyed by the XObject name string (e.g. img_p2_7).

This pixel data is not loaded until PDF.js processes the paint command during a render operation. getOperatorList() processes the operators and records references to image XObject names, but it does not push pixel data into page.objs. The page.objs map stays empty for image keys until you call page.render().

So the only way to get image pixels is to render the page.

The Web Worker Problem

PDF.js's default rendering pipeline allocates canvas elements via DOMCanvasFactory.create():

// Inside PDF.js source
class DOMCanvasFactory {
    create(width, height) {
        const canvas = document.createElement('canvas');  // <-- Worker death
        canvas.width = width;
        canvas.height = height;
        return { canvas, context: canvas.getContext('2d') };
    }
}

Web Workers don't have document. The crash is immediate and unhelpful:

Cannot read properties of undefined (reading 'createElement')

The fix is to replace the factory with one that uses OffscreenCanvas. PDF.js accepts a custom factory via getDocument(), but the option must be a class constructor, not an instance. PDF.js does:

const CanvasFactory = src.CanvasFactory || DefaultCanvasFactory;
// later, internally:
this._canvasFactory = new CanvasFactory({ ownerDocument, enableHWA });

Passing a plain object instance as canvasFactory (lowercase) is silently ignored. The correct form:

class OffscreenCanvasFactory {
    create(width, height) {
        const canvas = new OffscreenCanvas(width, height);
        return { canvas, context: canvas.getContext('2d') };
    }
    reset(canvasAndCtx, width, height) {
        canvasAndCtx.canvas.width  = width;
        canvasAndCtx.canvas.height = height;
    }
    destroy(canvasAndCtx) {
        canvasAndCtx.canvas.width  = 0;
        canvasAndCtx.canvas.height = 0;
        canvasAndCtx.canvas  = null;
        canvasAndCtx.context = null;
    }
}
const pdf = await pdfjsLib.getDocument({     data: bytes,     CanvasFactory: OffscreenCanvasFactory,  // capital C, class constructor }).promise;

OffscreenCanvas is available in all modern browser workers (Chromium, Firefox, Safari 16.4+). The gate in geometryWorker checks typeof OffscreenCanvas !== 'undefined' before entering the image extraction path.

The Two-Scale Architecture

The geometry pipeline runs at scale: 2.0 (192 DPI). This produces viewport coordinates for text items, segment positions, and image bboxes. All the coordinate math in contextClassifier.js, pathReconciler.js, and pageAssembler.js works in this scale.

Image crops need to be sharper. At 192 DPI, a 200x150 pixel image bbox produces a 200x150 crop that then gets CSS-stretched to fill its displayed container. The result is blurry.

The solution is a separate high-resolution render exclusively for image extraction:

const IMG_SCALE = 4.0;        // 384 DPI: dedicated image render
const upRatio   = IMG_SCALE / 2.0;  // = 2.0: converts 2x bbox coords to 4x pixels
const imgViewport = page.getViewport({ scale: IMG_SCALE }); const cw = Math.round(imgViewport.width); const ch = Math.round(imgViewport.height);
const pageCanvas = new OffscreenCanvas(cw, ch); await page.render({     canvasContext: pageCanvas.getContext('2d'),     viewport: imgViewport, }).promise;

Geometry still runs at 2.0. The second render only runs if imageMeta.length > 0. For text-only pages this costs nothing.

The Coordinate Transform

Image bboxes come from ctmAdapter.js. When the operator list hits a paintImageXObject, the current CTM is applied to the unit square [0,0]→[1,1] to get the placement rectangle in viewport space:

const [x1, y1] = toViewport(0, 0);
const [x2, y2] = toViewport(1, 1);
imageMeta.push({ id: imgId, bbox: {
    x: Math.min(x1, x2),
    y: Math.min(y1, y2),
    w: Math.abs(x2 - x1),
    h: Math.abs(y2 - y1),
}});

These coordinates are in 2.0-scale viewport space. To crop from the 4.0-scale canvas:

const { x, y, w, h } = meta.bbox;
const sx = Math.max(0, Math.round(x  upRatio));   // x  2.0
const sy = Math.max(0, Math.round(y  upRatio));   // y  2.0
const sw = Math.min(Math.round(w * upRatio), cw - sx);
const sh = Math.min(Math.round(h * upRatio), ch - sy);

The crop is clamped to canvas bounds (cw - sx) to handle bboxes that extend slightly past the page edge due to floating-point CTM rounding.

The Crop and Encode Path

const crop = new OffscreenCanvas(sw, sh);
crop.getContext('2d').drawImage(pageCanvas, sx, sy, sw, sh, 0, 0, sw, sh);
const blob = await crop.convertToBlob({ type: 'image/png' });
const arr = new Uint8Array(await blob.arrayBuffer());
let binary = ''; for (let b = 0; b < arr.length; b += 8192) {     binary += String.fromCharCode(...arr.subarray(b, b + 8192)); }
extractedImages[meta.id] = {     dataUrl: 'data:image/png;base64,' + btoa(binary),     pw: sw,  // pixel width at IMG_SCALE: used for natural display size     ph: sh, };

The 8192-byte chunk loop avoids String.fromCharCode(...arr) blowing the call stack on large arrays.

pw and ph are stored alongside the data URL. These are the actual crop dimensions, which are the ground truth for display size.

Displaying at Natural Size

The bbox from ctmAdapter is the destination rectangle, which is where the image gets painted on the page. It is not the native pixel dimensions of the image content. For an image that was scaled up 3x by the PDF layout engine, the bbox will be 3x larger than the natural content.

Using the bbox aspect ratio to create a sized container produces letterboxing: the container is shaped like the destination rectangle, but the image content fills only part of it.

The correct model: divide the crop pixel dimensions by IMG_SCALE / 1.0 (which is 4) to get natural CSS pixels, then let the browser handle scaling from there:

const natW = Math.round(imgEntry.pw / 4);
const natH = Math.round(imgEntry.ph / 4);
&lt;img class="extracted-pdf-image"      src="${dataUrl}"      width="${natW}"      height="${natH}"      alt="PDF Image ${region.id}"      style="max-width: 100%; height: auto; display: block;"&gt;

No object-fit. No wrapper with fixed aspect-ratio. The crop pixel dimensions are already the natural proportions of the image content. max-width: 100%; height: auto handles the case where the image is wider than its container.

Cluster Merging for Fragmented Figures

PDFs with complex figures can emit hundreds of paintImageXObject calls for a single visual. Chart axes, legend icons, and embedded glyphs are each a separate XObject. Passing each through the region pipeline produces hundreds of regions per figure.

The fix in contextClassifier.js runs before region assignment:

Filter images smaller than 20x20 viewport pixels (decorative elements, tick marks).
Sort remaining images by Y then X position.
Merge images within 8px gap into composite bounding clusters.
Each cluster becomes one IMAGE region, using the first-encountered image's ID as the representative key for the extractedImages map.

const cluster = mergedImages.find(c =>
    x <= c.right  + MERGE_GAP &&
    right  >= c.x - MERGE_GAP &&
    y <= c.bottom + MERGE_GAP &&
    bottom >= c.y - MERGE_GAP
);
if (cluster) {
    cluster.x      = Math.min(cluster.x,      x);
    cluster.y      = Math.min(cluster.y,      y);
    cluster.right  = Math.max(cluster.right,  right);
    cluster.bottom = Math.max(cluster.bottom, bottom);
}

The composite cluster bbox spans the full figure. The crop from that bbox captures the entire chart or diagram as one image.

Tradeoffs

The canvas-crop approach composites the full page before cropping. Theoretically, text or other graphics overlapping the image bbox could bleed into the crop. In practice, PDF layout engines rarely paint text over raster images: overlapping content is the exception, not the rule.

Native XObject pixel access (without render) remains impossible without running the rendering pipeline. The render-and-crop approach is the correct model for browser-side PDF image extraction.

Read this post in the full Engineering Journal →