Extracting Bold, Italic, and Underline from PDFs Without Guessing
TLDR: Build a fontStyleMap from page.commonObjs in the geometry worker. Each font name resolves to {bold, italic} flags from the actual parsed font descriptor. Merge onto textMeta items. Check transform[2] (shear component) for synthetic italic. textRebuilder wraps styled runs in <strong>/<em>/<u>.
Repo: tools/pdf-processor
The Problem
PDF.js gives each text item a fontName string. These look like ABCDEF+TimesNewRomanPS-BoldMT — a 6-character subset prefix followed by a PostScript-style variant name.
The prefix changes on every export. You cannot reliably regex-match the family before stripping it.
The Naive Attempt
Strip the prefix, then regex-match the remainder:
/bold|heavy|black/i.test(name.replace(/^[A-Z]{6}\+/, ''))
This works for well-named fonts. It fails silently for synthetic fonts named Font12 or F1, fonts with non-English variant names, and PDFs where the exporter normalized the font name.
page.commonObjs: The Authoritative Source
PDF.js exposes parsed font objects through page.commonObjs. Each font object has .bold and .italic boolean properties computed from the font's actual glyph metrics and descriptor — not its name. This is the ground truth.
const fontStyleMap = {};
const uniqueFontNames = [...new Set(textContent.items.map(i => i.fontName).filter(Boolean))];
for (const fn of uniqueFontNames) {
const obj = page.commonObjs.get(fn);
if (!obj) continue;
const cleaned = (obj.name || fn).replace(/^[A-Z]{6}\+/, '');
fontStyleMap[fn] = {
bold: !!obj.bold || /bold|heavy|black/i.test(cleaned),
italic: !!obj.italic || /italic|oblique|slanted/i.test(cleaned),
};
}
The || fallback: trust the font object first, fall back to the cleaned name for fonts where .bold is not set by the parser.
Synthetic Italic: The Shear Transform
Some PDFs produce italic-looking text by applying a shear matrix to an upright font rather than loading an actual italic variant.
The text item's transform array is [a, b, c, d, e, f]. The c component is horizontal shear. A non-zero c means the glyphs are slanted.
const syntheticItalic = Math.abs(item.transform[2]) > 0.01;
if (syntheticItalic) meta.italic = true;
This catches faux italic rendering that the font object alone would miss.
Underlines: Vector Segment Pairing
Underlines in PDFs are separate vector line segments drawn beneath text — not a font property. ctmAdapter.js classifies horizontal segments with a Y position within ~0.35× font size below a text item as underlines. The textMeta item receives underlined: true.
Propagating to the Renderer
Flags flow: geometryWorker → textMeta → _scopeItems → textItems passed to textRebuilder. The rebuilder groups consecutive same-style items into runs:
function _wrapInlineStyle(text, style) {
let html = _escHtml(text);
if (style.underlined) html = <u>${html}</u>;
if (style.italic) html = <em>${html}</em>;
if (style.bold) html = <strong>${html}</strong>;
return html;
}
Nesting order: underline outermost, bold innermost — matching standard HTML precedence for browser rendering.
Result
A line like WARNING Do not proceed without reading the SAFETY section in a PDF might produce:
<strong>WARNING</strong> Do not proceed without reading the <em>SAFETY</em> section
No OCR pass. No ML font classifier. Just font metadata that was already inside the PDF.