Pdf Processor

Extracting Bold, Italic, and Underline from PDFs Without Guessing

2026-05-15

TLDR: Build a fontStyleMap from page.commonObjs in the geometry worker. Each font name resolves to {bold, italic} flags from the actual parsed font descriptor. Merge onto textMeta items. Check transform[2] (shear component) for synthetic italic. textRebuilder wraps styled runs in <strong>/<em>/<u>.

Repo: tools/pdf-processor

The Problem

PDF.js gives each text item a fontName string. These look like ABCDEF+TimesNewRomanPS-BoldMT, which is a 6-character subset prefix followed by a PostScript-style variant name.

The prefix changes on every export. You cannot reliably regex-match the family before stripping it.

The Naive Attempt

Strip the prefix, then regex-match the remainder:

/bold|heavy|black/i.test(name.replace(/^[A-Z]{6}\+/, ''))

This works for well-named fonts. It fails silently for synthetic fonts named Font12 or F1, fonts with non-English variant names, and PDFs where the exporter normalized the font name.

page.commonObjs: The Authoritative Source

PDF.js exposes parsed font objects through page.commonObjs. Each font object has .bold and .italic boolean properties computed from the font's actual glyph metrics and descriptor, not its name. This is the ground truth.

const fontStyleMap = {};
const uniqueFontNames = [...new Set(textContent.items.map(i => i.fontName).filter(Boolean))];
for (const fn of uniqueFontNames) {
    const obj = page.commonObjs.get(fn);
    if (!obj) continue;
    const cleaned = (obj.name || fn).replace(/^[A-Z]{6}\+/, '');
    fontStyleMap[fn] = {
        bold:   !!obj.bold   || /bold|heavy|black/i.test(cleaned),
        italic: !!obj.italic || /italic|oblique|slanted/i.test(cleaned),
    };
}

The || fallback: trust the font object first, fall back to the cleaned name for fonts where .bold is not set by the parser.

Synthetic Italic: The Shear Transform

Some PDFs produce italic-looking text by applying a shear matrix to an upright font rather than loading an actual italic variant.

The text item's transform array is [a, b, c, d, e, f]. The c component is horizontal shear. A non-zero c means the glyphs are slanted.

const syntheticItalic = Math.abs(item.transform[2]) > 0.01;
if (syntheticItalic) meta.italic = true;

This catches faux italic rendering that the font object alone would miss.

Underlines: Vector Segment Pairing

Underlines in PDFs are separate vector line segments drawn beneath text, not a font property. ctmAdapter.js classifies horizontal segments with a Y position within ~0.35× font size below a text item as underlines. The textMeta item receives underlined: true.

Propagating to the Renderer

Flags flow: geometryWorker → textMeta → _scopeItems → textItems passed to textRebuilder. The rebuilder groups consecutive same-style items into runs:

function _wrapInlineStyle(text, style) {
    let html = _escHtml(text);
    if (style.underlined) html = &lt;u&gt;${html}&lt;/u&gt;;
    if (style.italic)     html = &lt;em&gt;${html}&lt;/em&gt;;
    if (style.bold)       html = &lt;strong&gt;${html}&lt;/strong&gt;;
    return html;
}

Nesting order: underline outermost, bold innermost, matching standard HTML precedence for browser rendering.

Result

A line like WARNING Do not proceed without reading the SAFETY section in a PDF might produce:

<strong>WARNING</strong> Do not proceed without reading the <em>SAFETY</em> section

No OCR pass. No ML font classifier. Just font metadata that was already inside the PDF.

Read this post in the full Engineering Journal →