Engineering Journal
Pdf Processor
Pdf Processor

Postmortem: Three Bugs That Broke the Analysis Panel Before Anyone Could Use It

2026-05-31

TLDR

Three bugs, three root causes, three fixes. The extraction hang came from including large image data in every postMessage. The phantom divider flood came from running detection before the regions it was supposed to check against existed. The ArrayBuffer detach came from a JavaScript runtime behavior that voids a Uint8Array after it's transferred to a sub-worker. None of them were obvious from static analysis. All three appeared only at runtime with real PDF files.

Bug 1: Extraction hangs forever with no error

Symptom: Loading a PDF caused the progress bar to appear and stay there indefinitely. No error in the console, no response from the worker, no timeout error.

Wrong hypothesis: The worker was crashing during region classification. Looked at the new classifyPage orchestrator, the PageGraph.build() call, the skip-set logic.

Actual root cause: extractedImages was included in the 'page' postMessage payload. The object contained base64 PNG crops at 4x scale — one entry per image region on the page. For a PDF with a full-page image, this was 4–8 MB per page being serialized via structuredClone on every single page message. The structured clone was not throwing — it was completing, but blocking the worker's event loop long enough to appear as an infinite hang from the main thread's perspective.

The HTML output already embedded these images as inline data: URLs through assemblePage. There was no reason to re-transmit them. The region manifest only needed imageId to link a region to its image, not the pixel data.

Fix: Remove extractedImages from the postMessage. Two lines deleted.

Why it wasn't caught earlier: The bug only manifests on PDFs with actual images. Text-only test PDFs worked fine because extractedImages was an empty object. The first image-heavy PDF loaded triggered it.

Bug 2: hr.pdf-divider appearing everywhere on every page

Symptom: The rendered HTML had <hr class="pdf-divider"> tags throughout the document — inside paragraphs, between sentences, in places where no visible rule existed in the original PDF.

Wrong hypothesis: The detectDividers function was matching segments it shouldn't — maybe the minimum length threshold (viewport.width * 0.15) was too low, or the nearText guard was failing.

Actual root cause: detectDividers ran at step 9 in the orchestrator, before _classifyBucket ran at step 11. The function's containment guard checked whether the segment's midpoint was inside any existing region:

if (regions.some(r => r.bbox && insideBBox(midX, midY, r.bbox, 5))) continue;

At step 9, regions contained only image, lattice table, stream table, and box regions. Paragraph, heading, and list regions didn't exist yet — they were created in step 11. So any H-segment sitting inside what would become a paragraph area passed the containment check and became a DIVIDER. On a text-heavy page, most of the inter-line spacing was inside future-paragraph bboxes. Every baseline-level H-segment that survived the underline check became a spurious divider.

The nearText guard should have caught these:

const nearText = textMeta.some(tm =>
    Math.abs(tm.vy - midY) < scale.S * 0.8 && ...
);
if (nearText) continue;

But S * 0.8 is a fairly tight tolerance. If the H-segment was more than 80% of a font size away from any text baseline — which is common in the inter-line space between two paragraphs — it passed this check too.

Fix: Move divider detection to step 11.5, after _classifyBucket completes. One block relocated. The detector now sees the full region list.

Why it wasn't caught earlier: The previous version of classifyPage ran divider detection at the same relative position in a single monolithic function. Because all the detection happened sequentially in one function, step ordering was implicit in the code flow. The refactor made the ordering explicit — and exposed that the assumed ordering was wrong.

Bug 3: ArrayBuffer detach kills reprocess on the second PDF

Symptom: Re-extract page worked once. After loading a second PDF, clicking Re-extract produced no response. The button stayed disabled. No error appeared.

Root cause (found and fixed by user): pdfjsLib.getDocument({ data: bytes }) internally transfers the ArrayBuffer underlying the Uint8Array to the PDF.js sub-worker. In JavaScript, transferring a buffer via postMessage detaches it in the source thread — bytes.byteLength becomes 0, and any attempt to read from it throws or returns an empty view.

The worker stored _cachedBytes = bytes after the initial extraction. After the transfer, _cachedBytes held a reference to the detached buffer. When _handleReprocess called pdfjsLib.getDocument({ data: _cachedBytes }), it received a zero-byte Uint8Array. PDF.js threw "empty PDF" internally. The error was caught by the try/catch in _handleReprocess, which posted { type: 'error' }. But fileUpload.js only checked for msg.reprocess on 'page' type messages — error messages fell through to the already-resolved main extraction promise's reject, which had no effect. Silent failure.

Three problems compounded:

  1. _cachedBytes was a direct reference, not a copy
  2. Error messages from reprocess weren't routed separately
  3. Error routing hit a dead-end promise
Fix: _cachedBytes = bytes.slice() before the pdfjsLib.getDocument call. .slice() creates an independent copy that doesn't share the underlying buffer. The user also added a cache-bytes message type for the case where the backend path was used for extraction — in that case, the local worker is never initialized, so _cachedBytes is never set. The new message type lets the main thread explicitly cache the bytes in the worker before re-extract is needed.

The pattern behind all three

All three bugs shared a structural cause: the new code path (Analysis panel) ran alongside existing code that made different assumptions about its environment.

The postMessage assumption: adding more data to a message is free. Not true when the data is megabytes.

The step-ordering assumption: detecting dividers before text regions is fine because dividers are geometric, not textual. Not true when the detector's guard relies on text regions.

The bytes assumption: storing a reference to the input bytes keeps them available. Not true when the JS runtime transfers the underlying buffer.

Each assumption was correct in isolation and wrong in the new context. The fix in each case was small — a .slice(), a moved block, two deleted lines. The cost was in discovering which assumption had broken.

Read this post in the full Engineering Journal →