Pdf Processor

Image Extraction Hottake

2026-07-11

Hot Take: PDF Image Extraction Is a Render Problem, Not a Decode Problem

TLDR: Every blog post and Stack Overflow answer about extracting images from PDFs in the browser treats it as a decoding problem: "how do I get the JPEG/PNG bytes out of the PDF?" That is the wrong question. The right question is "where in the rendering pipeline can I intercept pixel-complete image data?" The answer is always: after the page renders to canvas.

The Framing Everyone Gets Wrong

Search for "extract images from PDF JavaScript" and you will find:

"Use pdf-lib to iterate over XObjects and extract the raw image streams"
"Use pdfjs-dist and read from page.objs after getOperatorList"
"There's no built-in way, you need a server-side library"

All three answers share the same framing: images are resources embedded in the PDF file that you extract like files from a ZIP archive.

This framing is wrong for the browser.

PDF images are paint commands. An image XObject is a dictionary entry that says "here are some compressed pixels." The page's content stream says "paint this XObject at this CTM." The rendered result is those pixels composited onto the page canvas at the specified position and scale.

In a browser, you are not operating at the file format level. You are operating at the rendering level. You have page.render(). Use it.

Why the "Get the Bytes" Approach Fails

page.objs sounds like an accessible cache of image data. The reality: PDF.js only populates page.objs during the rendering pipeline. If you haven't called page.render(), the image objects are not there yet. You will get "Requesting object that isn't resolved yet" if you call .get(id) synchronously.

The callback form page.objs.get(id, callback) works in theory. PDF.js calls the callback when the object resolves. But "when the object resolves" means during a render context. Without a render, the callback never fires.

The underlying issue: PDF.js is not a PDF parser that exposes raw resources. It is a PDF renderer whose internal state you can observe. You don't get image data by asking for it before rendering. You get it as a consequence of rendering.

The Render-and-Crop Model Is What Real Libraries Do

pdf2htmlEX renders each page to SVG or HTML canvas and crops image regions. Adobe Acrobat's "Extract Images" tool renders the page and captures layer content. Even Python's pdfplumber (which is backed by pdfminer) renders pages internally when you call crop() on them.

The consensus engineering approach for browser-side image extraction is:

Render the page to a canvas at sufficient resolution.
Use the image placement bboxes (from the operator list or CTM) to identify crop rectangles.
Draw each crop region to a new canvas and encode as PNG/JPEG.

This is what the geometry worker now does. The only non-obvious piece was that you need to replace PDF.js's canvas factory with an OffscreenCanvas implementation to make this work inside a Web Worker.

Two Scales Are Not Overengineering

The objection I'd expect: "rendering the page twice per image-containing page is wasteful."

It isn't. The geometry pipeline runs at 2x scale and produces coordinates for text items, table lines, and image bboxes. All of that math is calibrated to 2x. Raising the global scale to 4x to get sharper images would require recalibrating every threshold in the column detection, table detection, and paragraph gap detection, all of which are expressed in viewport pixels.

Instead, the image extraction step renders a separate 4x canvas. For pages with no images, this code path never runs. For pages with images, you pay one extra render per page. The result is 384 DPI crops instead of 192 DPI crops, producing images that are actually usable at their displayed size.

The alternative (one render, one scale for everything) produces either blurry images or miscalibrated text thresholds. Two scales with different purposes is the correct model.

The Fragmentation Problem Is Underestimated

731 image regions per page. That's what happens when you naively map paintImageXObject operators to regions. A single data visualization can contain dozens of XObject calls: axis markers, grid lines rendered as tiny image segments, legend icons, embedded raster photographs.

The right approach is to treat image regions the same way geographic mapping treats point clusters: aggregate nearby points into meaningful groupings. Filter noise (< 20px), merge adjacent items within a gap threshold, and emit one region per visual cluster.

This is not a workaround. This is the correct semantic interpretation. When a PDF designer created a chart, they created one chart. The fact that the authoring tool emitted 60 paintImageXObject calls to paint it is an implementation detail of the PDF generator, not a fact about the content.

What This Tells You About PDF Extraction Generally

Every PDF extraction problem has the same shape: the file format represents content as paint commands, and your extractor needs to invert those paint commands into semantic content. That inversion is always a rendering problem, not a parsing problem.

Images: render, then crop. Tables: reconstruct the grid from the set of line segments that were drawn. Columns: find gaps in the X-coverage of text items. Headings: identify text items whose font size is above the document's body-text mode.

None of these are solved by reading raw PDF bytes. All of them are solved by understanding what the rendering pipeline produces and working backward from there.

The browser gives you page.render(). That is the starting point, not the last resort.

Read this post in the full Engineering Journal →