Pdf Processor

Why Your PDF.js Image Extraction Silently Returns Nothing (and How to Fix It)

2026-05-11

I built a complete image extraction pipeline for my browser-native PDF processor. ctmAdapter caught every paintImageXObject operator. geometryWorker passed image data through postMessage. fileUpload wrote Blobs to IndexedDB. pageAssembler emitted semantic <img data-img-id> placeholders. hydrateImages read them back and injected blob: URLs.

The preview showed dashed gray boxes. IndexedDB had zero records. No console errors.

Here is what was wrong, why the obvious fixes do not work, and what actually solves it.

The Problem: page.objs.get() Is Synchronous and Throws

The original extraction loop looked like this:

for (const meta of imageMeta) {
    try {
        const obj = await page.objs.get(meta.id);
        if (obj && obj.data && obj.width && obj.height) {
            // ... encode to PNG ...
        }
    } catch (e) {
        // Safe to ignore
    }
}

In PDF.js v4, page.objs is a PDFObjects instance. The get(id) method without a callback checks whether the object's internal PromiseCapability has been settled. If it has not, it throws:

Requesting object that isn't resolved yet: img_37.

This is a synchronous throw. The await does nothing here. The function throws before returning anything. The catch block fires. The image is skipped. This happens for every image on every page. The loop completes in milliseconds. Nothing is written to IDB. No error appears in the console because the catch swallows it.

Naive Fix 1: Use the Callback Form

page.objs.get() accepts an optional callback as the second argument:

const obj = await new Promise(resolve => page.objs.get(meta.id, resolve));

With a callback, instead of throwing, the method registers the callback against the internal PromiseCapability. When the object resolves, the callback fires. No throw.

This is correct API usage. But it does not help here.

The Promise never resolves. The callback is never called.

The image capability never resolves because PDF.js never loads the image pixel data in the first place.

Why PDF.js Doesn't Load Image Pixels Without a Render

PDF.js processes pages lazily. page.getOperatorList() reads the page's content stream and produces a list of operations. When it encounters paintImageXObject img_37, it records the operation and notes the XObject dependency. But it does not load the actual pixel data.

The pixel data loading is triggered by page.render(). When you render, PDF.js fetches each referenced XObject from the PDF's stream, decodes it (JPEG, PNG, JBIG2, CCITT, whatever), and pushes the decoded pixel array into page.objs. Once render completes, page.objs.get(id) works synchronously.

Without a render, the pixel data never arrives. The callback form registers a listener but the listener fires when the capability resolves, and the capability only resolves when PDF.js actually loads the data, which only happens on render.

getOperatorList() resolves → operators are known → dependencies noted ↓ image pixels NOT yet in page.objs

page.render() runs → PDF.js fetches each XObject → decodes pixels → page.objs.resolve(id, data) ↓ page.objs.get(id) now works

Naive Fix 2: Render and Crop

The straightforward response: render the page, then crop each image region using the bounding boxes from ctmAdapter.

const pageCanvas = new OffscreenCanvas(Math.round(viewport.width), Math.round(viewport.height));
await page.render({ canvasContext: pageCanvas.getContext('2d'), viewport }).promise;
for (const meta of imageMeta) {     const { x, y, w: iw, h: ih } = meta.bbox;     const imgCanvas = new OffscreenCanvas(Math.round(iw), Math.round(ih));     imgCanvas.getContext('2d').drawImage(pageCanvas, Math.round(x), Math.round(y), Math.round(iw), Math.round(ih), 0, 0, Math.round(iw), Math.round(ih));     extractedImages[meta.id] = await imgCanvas.convertToBlob({ type: 'image/png' }); }

This works. Images appear in IDB. The preview hydrates.

But the result is wrong in subtle ways:

Content bleed. The crop is a screenshot of a region of the rendered page. Any text, border, or graphic that overlaps the image bounding box is baked into the result. A caption that sits 3 pixels below the figure's bbox gets included if the bbox calculation is slightly off. A table border that grazes the corner appears in the output.

Composite rendering. The cropped pixels reflect the full PDF rendering stack: color space transforms, masking, compositing, transparency. You are not getting the source image. You are getting what the PDF renderer decided to paint at that coordinate.

Scale dependency. The crop dimensions are viewport-space pixels. The native image may be 2480 x 3508 but the crop gives you Math.round(bbox.w) pixels at the viewport scale. You lose native resolution.

The Correct Fix: Render to Populate, Then Read from page.objs

The render is not the mistake. The crop is.

Use render solely to populate page.objs. After render, read each image's native pixel data directly from page.objs.get(id) synchronously. Encode that to a Blob at native resolution.

const extractedImages = {};
if (imageMeta.length > 0 && typeof OffscreenCanvas !== 'undefined') {
    try {
        // Render to a throwaway canvas — we only need the side effect of
        // populating page.objs with decoded pixel data for each image XObject.
        const pageCanvas = new OffscreenCanvas(
            Math.round(viewport.width),
            Math.round(viewport.height)
        );
        await page.render({ canvasContext: pageCanvas.getContext('2d'), viewport }).promise;
for (const meta of imageMeta) {             try {                 const obj = page.objs.get(meta.id); // safe now — render populated objs                 if (!obj || !obj.data || !obj.width || !obj.height) continue;
const { width: w, height: h, data } = obj;                 const rgba = new Uint8ClampedArray(w  h  4);
if (data.length === w  h  4) {                     // Already RGBA                     rgba.set(data);                 } else if (data.length === w  h  3) {                     // RGB — expand with full alpha                     for (let i = 0, j = 0; i < data.length; i += 3, j += 4) {                         rgba[j] = data[i]; rgba[j+1] = data[i+1];                         rgba[j+2] = data[i+2]; rgba[j+3] = 255;                     }                 } else if (data.length === w * h) {                     // Grayscale — replicate to RGB channels                     for (let i = 0, j = 0; i < data.length; i++, j += 4) {                         rgba[j] = rgba[j+1] = rgba[j+2] = data[i]; rgba[j+3] = 255;                     }                 } else continue;
const imgCanvas = new OffscreenCanvas(w, h);                 imgCanvas.getContext('2d').putImageData(new ImageData(rgba, w, h), 0, 0);                 extractedImages[meta.id] = await imgCanvas.convertToBlob({ type: 'image/png' });             } catch (_) {}         }     } catch (e) {         console.warn('[geometryWorker] image extraction failed:', e.message);     } }

What you get:

Native resolution. obj.width and obj.height are the image's actual pixel dimensions, not the viewport-scaled render size.
Clean isolation. The pixel data is the decoded image stream, not a composite of everything the renderer painted at that screen coordinate.
No content bleed. Text above the image, a border below it, transparent overlays — none of that appears in the extracted Blob.
Three channel formats handled. PDF images can be grayscale (1 byte/pixel), RGB (3 bytes/pixel), or RGBA (4 bytes/pixel). The channel expansion loop handles all three.

What About Vector Graphics?

A common question when extracting images from PDFs: does this rasterize vector diagrams?

No. And the reason is structural.

In a PDF, paintImageXObject refers to a raster bitmap. A JPEG. A PNG. A TIFF. Pixel data with explicit width and height stored in the PDF's XObject dictionary.

Vector graphics — paths, curves, fills, strokes — are encoded as content stream operators: m (moveTo), l (lineTo), c (bezierCurveTo), f (fill), S (stroke). These are not XObjects. They do not appear in imageMeta. The ctmAdapter only catches paintImageXObject, paintImageMaskXObject, and paintJpegXObject.

If a PDF has a vector chart drawn with path operators, it has no presence in imageMeta. It flows through the separate segment extraction path. It is classified as a region type based on its bounding box and line density. It is not touched by the image extraction pipeline at all.

The image pipeline is strictly for raster XObjects. Vector content is strictly for the segment pipeline. They do not intersect.

The Complete Data Flow After the Fix

geometryWorker:
  getOperatorList() + getTextContent()   ← operator list (no pixel data yet)
  render() to throwaway OffscreenCanvas  ← forces page.objs population
  for each imageMeta:
    page.objs.get(id)                    ← synchronous, safe after render
    build RGBA array                     ← handle grayscale / RGB / RGBA
    OffscreenCanvas → convertToBlob()    ← native resolution PNG Blob
  postMessage({ type: 'page', images })  ← Blobs transferred to main thread
fileUpload:   Object.assign(allImages, msg.images)   ← accumulate across pages   on complete: saveImages(allImages)     ← write Blobs to IndexedDB
imageStore.saveImages:   for [id, blob] of entries:     store.put(blob, id)                  ← Blobs stored directly, no conversion
fileUpload after applyHtmlEverywhere:   hydrateImages(el) on html-preview      ← img[data-img-id] → blob: URL   hydrateImages(el) on visual-diff-html  ← same for diff surface

The key invariant: images only leave the worker as Blobs. IDB stores Blobs. hydrateImages reads Blobs and creates blob: object URLs. At no point are data URIs constructed on the hot path. The FileReader round-trip that was in the original code is gone.

Tradeoffs

Render overhead. Every page with at least one image now renders twice: once for the actual geometry worker pipeline, and once (the throwaway render) to populate page.objs. For image-heavy PDFs this adds meaningful time. The alternative is a dedicated rendering pass that serves both purposes, but that requires merging the render result with the geometry extraction, which complicates the worker architecture. For now the overhead is acceptable.

OffscreenCanvas availability. The entire image path is guarded by typeof OffscreenCanvas !== 'undefined'. If the browser does not support OffscreenCanvas in a worker context, image extraction is silently skipped. All text, tables, headings, and lists still work. Images produce empty placeholders. This is the correct graceful degradation for a privacy-first, no-backend tool.

Format homogeneity. All images are exported as PNG regardless of their source encoding. A JPEG embedded in the PDF becomes a PNG in the output. This increases file size for photographic content. Future improvement: check for paintJpegXObject and pass the raw JPEG bytes through without re-encoding. PDF.js provides access to the raw stream for JPEG XObjects.

Read this post in the full Engineering Journal →