Engineering Journal
Pdf Processor
Pdf Processor

How to Detect WARNING and CAUTION Boxes in PDFs Using Only Vector Geometry

2026-05-15

TLDR: ctmAdapter.js tracks fillColor through q/Q save/restore stacks. Any re (rectangle) followed by a fill paint op becomes a filledRect. contextClassifier pairs rects with enclosed text to classify BOX regions with roles (warning/caution/note). pageAssembler emits <aside class="pdf-box pdf-box--warning">.

Repo: tools/pdf-processor

The Problem

The PDF text stream gives you text items and their X/Y positions. It does not tell you that "NOTICE" is rendered inside a bordered rectangle with a blue background.

The structural signal that makes a warning box a warning box is invisible to the text extractor.


The Naive Attempt

Keyword matching. If a text region contains "WARNING", "CAUTION", or "NOTICE", treat it as a warning box.

This finds text that mentions warnings, not text that is a warning. A sentence reading "See WARNING section before proceeding" would fire the heuristic. An uncaptioned warning box without a keyword header would be missed entirely.


The Real Structure Is in the Operator List

PDFs describe drawing commands in a stream of operators. getOperatorList() from PDF.js exposes these. Among them:

A warning box in a PDF is: set fill color → define rectangle → fill. That sequence is deterministic.

Tracking Color Through Save/Restore

Fill color state lives inside a graphics stack. q saves the current state; Q restores it. Without tracking the stack, we get the wrong color for any rectangle drawn inside a q...Q block.

ctmAdapter.js maintains a colorStateStack:

const colorStateStack = [{ fill: [0,0,0], stroke: [0,0,0] }];

case OPS.save: colorStateStack.push({ fill: fillColor.slice(), stroke: strokeColor.slice() }); break; case OPS.restore: const prev = colorStateStack.pop(); if (prev) { fillColor = prev.fill.slice(); strokeColor = prev.stroke.slice(); } break;

When re fires, we record the rectangle with the current fillColor. When a paint op fires, we commit it to filledRects. The color is always the one active at the moment of painting.


The Black Fill Problem

PDF's default fill color is [0,0,0] — black. Any rectangle drawn before a fill color is explicitly set will have a black background. This is not a semantic box; it is a PDF default state artifact.

We filter both extremes:

const isNeutral = !fc
    || fc.every(c => c > 0.92)   // near-white (page color)
    || fc.every(c => c < 0.08);  // near-black (PDF default)

Only chromatic or tinted fills carry semantic meaning.


Classification: BOX Role Detection

contextClassifier pairs filledRects with text items by containment. A rectangle enclosing 2+ non-space text items that passes the page-frame guard becomes a BOX region. Role is inferred from text content:

if (/warning|danger/i.test(text))                  role = 'warning';
else if (/caution/i.test(text))                    role = 'caution';
else if (/note|notice|important/i.test(text))      role = 'note';
else if (/tip|hint/i.test(text))                   role = 'tip';
else                                               role = 'generic';

The Page-Frame Guard

Every PDF page has an outer border rectangle spanning nearly the full width. Without a guard, it would match as a BOX and consume all text on the page.

const _isPageFrame = (bx, bw) =>
    (bx < vpW  0.04 && bw > vpW  0.65) || bw > vpW * 0.88;

Any rectangle starting near x=0 and spanning more than 65% of page width is the page border, not a content box.


Output

pageAssembler renders BOX regions as:

<aside class="pdf-box pdf-box--warning f2 ta-l" style="background:rgb(255,245,245)">
  <p><strong>WARNING</strong></p>
  <p>Do not operate without proper ventilation.</p>
</aside>

No OCR. No ML. Pure vector geometry — available in any PDF.

Read this post in the full Engineering Journal →