Pdf Processor

Postmortem: Three Bugs That Were All the Same Mistake

2026-05-31

TLDR

Three bugs in the same session, three different symptoms, one structural cause: new code that ran alongside existing code but made different assumptions about the environment. The structured-clone assumption that serializing data is cheap. The step-ordering assumption that detecting a region type before dependent regions exist is fine. The bytes assumption that a reference to input data remains valid after the data is transferred to a sub-worker.

Repo: tools/pdf-processor

The Assumption That Seemed Reasonable (Three Times)

Adding an observability layer to an existing pipeline means adding new output fields and new message types alongside existing ones. The new code runs in the same worker as the existing code. It has access to the same data. The assumption: since it is running alongside working code, the environment is already correct for it.

This assumption is wrong in at least three ways. All three surfaced in the same session.

Bug 1: Adding Large Data to a postMessage Blocks the Worker

The assumption: adding more fields to a postMessage payload is cost-free.

What happened: extraction appeared to hang indefinitely on any PDF with images. No error, no timeout, no response from the worker. The progress bar appeared and stayed there.

The actual cause: the new region manifest added an extractedImages field to every 'page' message. For a PDF page with a full-page illustration, this field contained 4-8 MB of base64-encoded PNG data per page. The structured clone algorithm does not throw on large objects: it completes. But it blocked the worker's event loop long enough that the main thread's response handler never fired within any reasonable timeout.

The fix was two deletions: remove extractedImages from the postMessage. The HTML output already embedded images as inline data: URLs through the assembly step. The manifest only needed an image ID to link a region to its image, not the pixel data itself.

The lesson: postMessage with large payloads does not fail. It stalls. The symptom (infinite hang, no error) is indistinguishable from a worker crash until you measure the payload size.

Bug 2: A Classification Stage That Ran Before Its Dependencies Existed

The assumption: a detector that checks "is this segment inside a known region?" can run at any point in the pipeline.

What happened: extracted HTML contained spurious horizontal rule elements scattered through paragraph text. On documents with any horizontal lines near body copy, dividers were appearing mid-paragraph.

The actual cause: the divider detection stage ran before text classification. At that point, the regions list only contained image, table, and box entries. Paragraph regions did not exist. The containment check, "is this segment already inside a known region?", always returned false for segments inside future-paragraph areas, so every horizontal line became a divider.

The fix was moving the divider detection call one position later in the orchestrator sequence, to after text classification had populated the regions list.

The lesson: a correctness check on shared state is only correct if the state is fully populated. When a pipeline uses a mutable shared collection, every stage that reads it must document which stages it depends on having run first.

Bug 3: A Reference to Input Data That the Runtime Had Voided

The assumption: storing a reference to input bytes keeps them available for later use.

What happened: single-page re-extraction worked once. After loading a second document, clicking Re-extract produced no response. Silent failure.

The actual cause: pdfjsLib.getDocument({ data: bytes }) transfers the ArrayBuffer underlying the Uint8Array to a PDF.js sub-worker. JavaScript's postMessage with a transfer list detaches the buffer from the source thread: bytes.byteLength becomes 0 after the call returns. The worker had stored cachedBytes = bytes after the initial extraction. After the transfer, cachedBytes was a reference to a detached buffer. The second re-extract call received a zero-byte array, PDF.js threw "empty PDF" internally, and the error was swallowed by a try/catch that had no handler for the reprocess path.

The fix was cachedBytes = bytes.slice() before the first getDocument call. .slice() creates an independent copy that does not share the underlying buffer.

The lesson: in the browser's concurrency model, passing typed arrays to workers, Wasm modules, or any API that uses Transferable objects is destructive by default. A reference to the source array after a transfer is a reference to an empty view. Copy before you transfer if you need the data again.

What Survived

All three bugs were in the new code, not in the existing pipeline. The classification algorithms, the assembly logic, the coordinate transforms, and the worker isolation model were all correct. None of them needed to change.

The new observability layer ran alongside all of that, and each of its three components made one wrong assumption about the environment it was in.

The Lesson

New code running in an existing environment inherits the environment's characteristics, not just its output. Before writing observability code for a pipeline, audit the existing pipeline's environmental constraints: what are the serialization limits? What are the data structure lifetime rules? What are the stage-ordering invariants? The existing code that works has already solved those constraints. New code that runs alongside it needs to solve them too.

Read this post in the full Engineering Journal →