The ArrayBuffer Detach Bug: Why Your Cached Worker Bytes Go Empty After the First PDF
TL;DR
pdfjsLib.getDocument({ data: bytes }) transfers the underlying ArrayBuffer of the Uint8Array to the PDF.js sub-worker. After this call, the original bytes reference in your worker has byteLength === 0. If you cache that reference for later use (single-page re-extraction, for example), every subsequent read produces an empty buffer. Fix: call bytes.slice() before caching, not after.
Symptom
Re-extract page triggered no response from the worker. No error in the console. No postMessage back to the main thread. The button stayed in its loading state indefinitely. A second PDF loaded fine for initial extraction — but re-extract still failed silently.
Root cause
JavaScript's postMessage with transfer semantics, used by PDF.js internally when it spawns its parsing sub-worker, detaches the source ArrayBuffer. The specification says:
EachArrayBufferin the transfer list is neutered. Its[[ArrayBufferData]]internal slot is set to null and its[[ArrayBufferByteLength]]is set to 0.
This is not a bug — it's intentional for zero-copy transfer performance. But it means any Uint8Array view that was pointing at that buffer now has byteLength === 0. Reading from it is either a no-op or throws, depending on what you do with it.
The worker stored:
let _cachedBytes = null;
self.onmessage = async (e) => { const { bytes } = e.data; _cachedBytes = bytes; // direct reference, not a copy const pdf = await pdfjsLib.getDocument({ data: bytes }).promise; // bytes is now detached — _cachedBytes points to empty buffer
When _handleReprocess later called:
const pdf = await pdfjsLib.getDocument({ data: _cachedBytes }).promise;
PDF.js received a zero-length Uint8Array. It threw "Failed to read PDF" or "Empty PDF" internally. The error was caught by the try/catch and posted as { type: 'error', error: ... } — but the main thread's error handler for that message was already dead (the original extraction had resolved its promise). Silent swallow.
Fix
// Cache bytes BEFORE pdfjsLib.getDocument() transfers the buffer
_cachedBytes = bytes.slice(); // independent copy, not a view of the same buffer
const pdf = await pdfjsLib.getDocument({ data: bytes, ...canvasFactoryOpt }).promise;
.slice() on a Uint8Array creates a new ArrayBuffer and copies the data. The cached copy is independent of the transfer. After getDocument() detaches bytes, _cachedBytes is unaffected.
Note the order: cache first, then pass to getDocument. Not the other way around.
Guard
Three guards to add whenever you cache bytes in a worker that uses pdfjs-dist (or any library that transfers buffers internally):
1. Always slice before caching:
_cachedBytes = bytes.slice(); // not _cachedBytes = bytes
2. Validate before use:
async function _handleReprocess({ page }) { if (!_cachedBytes || _cachedBytes.byteLength === 0) { self.postMessage({ type: 'error', error: 'No cached bytes. Run full extraction first.' }); return; } // ... }
3. Route worker errors separately from the main extraction promise:
The main extraction promise in fileUpload.js was set as worker.onmessage — overwritten on each call. Error messages from a later reprocess operation arrived on the same handler and fell into the already-resolved promise's reject, which is a no-op. The fix is to check msg.reprocess before routing:
worker.onmessage = (e) => {
const msg = e.data;
if (msg.type === 'page' && msg.reprocess) {
onReprocessResult(msg.page, msg.html, msg.regions, msg.pageScale);
return;
}
if (msg.type === 'error' && / detect if this came from reprocess /) {
// show toast to user, reset re-extract button
return;
}
// normal extraction routing below
};
Lesson
Any JavaScript library that uses Web Workers internally and accepts a Uint8Array may transfer its buffer. The signature getDocument({ data: bytes }) gives no indication that bytes will be mutated. This is a category of bug that only appears at runtime with real data, is silent by default (no exception thrown at the call site), and is impossible to detect from static analysis.
When caching input data for re-use in a worker: always .slice(). The cost is a memory copy. The cost of not doing it is a category of bug that produces no errors and wrong behavior hours after the session that introduced it.