Detecting PDF Headers and Footers Without Relying on Page Margins
TLDR: Margin-position thresholds break on full-bleed titles and non-standard layouts. Use three signals instead: font size smaller than 0.78× body average, a background fill rect spanning the region's Y band, and a pattern match for page numbers or dates. Always reset columnIndex = -1 when reclassifying — the post-pass runs after column detection.
Repo: tools/pdf-processor
The Problem
The obvious approach: if a region's Y center is within N% of the top or bottom of the page, it is a header or footer.
This breaks on pages where a full-bleed title block spans 25% of the page from the top, where a large figure caption sits near the bottom margin, or where the document has no header and body content starts immediately.
The Naive Position Threshold
if (region.yCenter < vpH * 0.08) region.type = RegionType.HEADER;
if (region.yCenter > vpH * 0.92) region.type = RegionType.FOOTER;
The threshold is arbitrary. An 8% threshold on an A4 page at 1.5× scale is about 85px. A two-line header with a logo easily exceeds that.
Three Signals Instead of One
Signal 1: Relative font size
Headers and footers consistently use smaller fonts than body text. If the region's average font size is less than 0.78× the page body font size, it is a candidate regardless of its vertical position.
const smallFont = avgFontPt < bodyFontPt * 0.78;
Signal 2: Background color band
Many PDFs render headers and footers on colored bands — full-width filled rectangles at the top or bottom. We already track filledRects from ctmAdapter. A wide rect overlapping the region's Y range is strong evidence.
const inColoredBand = filledRects.some(r =>
r.w > vpW * 0.6 &&
r.y <= regionTop + tol &&
r.y + r.h >= regionBottom - tol
);
Signal 3: Pattern match
Page numbers, dates, and document references follow predictable patterns:
const patternMatch = /\bpage\s+\d+|\bpg\.?\s\d+|\b\d+\s+of\s+\d+|\b\d{4}-\d{2}-\d{2}|\brev\.\s\w+/i.test(regionText);
Combining the Signals
A region is reclassified as HEADER or FOOTER if it is in the top/bottom positional zone AND at least one of (smallFont OR patternMatch OR inColoredBand) is true.
The three signals together reject false positives: figure captions are not small-font, pull quotes have no page number pattern, and content images generate no colored band signal.
The Critical Bug: columnIndex Must Reset
The header/footer post-pass runs after column detection. Regions already had valid columnIndex values (0, 1, etc.) assigned from the multi-column pass. When a region was reclassified as HEADER or FOOTER, the type changed but columnIndex stayed at its assigned column.
The result: headers appeared inside column grid divs, rendered as .pdf-col--left children of the multi-column row.
The fix is one line:
r.type = inTop ? RegionType.HEADER : RegionType.FOOTER;
r.columnIndex = -1; // MUST reset — post-pass runs after column detection
columnIndex = -1 is the assembler's signal to render the region full-width, outside any column container.
Stray Glyph Filter
A single character — often "I" from a page number like "I-1" split across a line — was matching the small-font heuristic and being classified as a HEADER. The fix rejects sparse regions:
const nonSpaceLen = regionText.replace(/\s/g, '').length;
if (nonSpaceLen < 2 && !patternMatch) continue;
A region with fewer than 2 non-space characters that does not match the page-number pattern is skipped entirely.