How to Detect WARNING and CAUTION Boxes in PDFs Using Only Vector Geometry
TLDR: ctmAdapter.js tracks fillColor through q/Q save/restore stacks. Any re (rectangle) followed by a fill paint op becomes a filledRect. contextClassifier pairs rects with enclosed text to classify BOX regions with roles (warning/caution/note). pageAssembler emits <aside class="pdf-box pdf-box--warning">.
Repo: tools/pdf-processor
The Problem
The PDF text stream gives you text items and their X/Y positions. It does not tell you that "NOTICE" is rendered inside a bordered rectangle with a blue background.
The structural signal that makes a warning box a warning box is invisible to the text extractor.
The Naive Attempt
Keyword matching. If a text region contains "WARNING", "CAUTION", or "NOTICE", treat it as a warning box.
This finds text that mentions warnings, not text that is a warning. A sentence reading "See WARNING section before proceeding" would fire the heuristic. An uncaptioned warning box without a keyword header would be missed entirely.
The Real Structure Is in the Operator List
PDFs describe drawing commands in a stream of operators. getOperatorList() from PDF.js exposes these. Among them:
re: defines a rectangle path by x, y, width, heightf/F/eoFill/fillStroke: paint the current path as filledrg/RG/g/G/k/K: set fill or stroke color
Tracking Color Through Save/Restore
Fill color state lives inside a graphics stack. q saves the current state; Q restores it. Without tracking the stack, we get the wrong color for any rectangle drawn inside a q...Q block.
ctmAdapter.js maintains a colorStateStack:
const colorStateStack = [{ fill: [0,0,0], stroke: [0,0,0] }];
case OPS.save: colorStateStack.push({ fill: fillColor.slice(), stroke: strokeColor.slice() }); break; case OPS.restore: const prev = colorStateStack.pop(); if (prev) { fillColor = prev.fill.slice(); strokeColor = prev.stroke.slice(); } break;
When re fires, we record the rectangle with the current fillColor. When a paint op fires, we commit it to filledRects. The color is always the one active at the moment of painting.
The Black Fill Problem
PDF's default fill color is [0,0,0] — black. Any rectangle drawn before a fill color is explicitly set will have a black background. This is not a semantic box; it is a PDF default state artifact.
We filter both extremes:
const isNeutral = !fc
|| fc.every(c => c > 0.92) // near-white (page color)
|| fc.every(c => c < 0.08); // near-black (PDF default)
Only chromatic or tinted fills carry semantic meaning.
Classification: BOX Role Detection
contextClassifier pairs filledRects with text items by containment. A rectangle enclosing 2+ non-space text items that passes the page-frame guard becomes a BOX region. Role is inferred from text content:
if (/warning|danger/i.test(text)) role = 'warning';
else if (/caution/i.test(text)) role = 'caution';
else if (/note|notice|important/i.test(text)) role = 'note';
else if (/tip|hint/i.test(text)) role = 'tip';
else role = 'generic';
The Page-Frame Guard
Every PDF page has an outer border rectangle spanning nearly the full width. Without a guard, it would match as a BOX and consume all text on the page.
const _isPageFrame = (bx, bw) =>
(bx < vpW 0.04 && bw > vpW 0.65) || bw > vpW * 0.88;
Any rectangle starting near x=0 and spanning more than 65% of page width is the page border, not a content box.
Output
pageAssembler renders BOX regions as:
<aside class="pdf-box pdf-box--warning f2 ta-l" style="background:rgb(255,245,245)">
<p><strong>WARNING</strong></p>
<p>Do not operate without proper ventilation.</p>
</aside>
No OCR. No ML. Pure vector geometry — available in any PDF.