Engineering Journal
Pdf Processor
Pdf Processor

Extracting Bold, Italic, and Underline from PDFs Without Guessing

2026-05-15

TLDR: Build a fontStyleMap from page.commonObjs in the geometry worker. Each font name resolves to {bold, italic} flags from the actual parsed font descriptor. Merge onto textMeta items. Check transform[2] (shear component) for synthetic italic. textRebuilder wraps styled runs in <strong>/<em>/<u>.

Repo: tools/pdf-processor

The Problem

PDF.js gives each text item a fontName string. These look like ABCDEF+TimesNewRomanPS-BoldMT — a 6-character subset prefix followed by a PostScript-style variant name.

The prefix changes on every export. You cannot reliably regex-match the family before stripping it.


The Naive Attempt

Strip the prefix, then regex-match the remainder:

/bold|heavy|black/i.test(name.replace(/^[A-Z]{6}\+/, ''))

This works for well-named fonts. It fails silently for synthetic fonts named Font12 or F1, fonts with non-English variant names, and PDFs where the exporter normalized the font name.


page.commonObjs: The Authoritative Source

PDF.js exposes parsed font objects through page.commonObjs. Each font object has .bold and .italic boolean properties computed from the font's actual glyph metrics and descriptor — not its name. This is the ground truth.

const fontStyleMap = {};
const uniqueFontNames = [...new Set(textContent.items.map(i => i.fontName).filter(Boolean))];
for (const fn of uniqueFontNames) {
    const obj = page.commonObjs.get(fn);
    if (!obj) continue;
    const cleaned = (obj.name || fn).replace(/^[A-Z]{6}\+/, '');
    fontStyleMap[fn] = {
        bold:   !!obj.bold   || /bold|heavy|black/i.test(cleaned),
        italic: !!obj.italic || /italic|oblique|slanted/i.test(cleaned),
    };
}

The || fallback: trust the font object first, fall back to the cleaned name for fonts where .bold is not set by the parser.


Synthetic Italic: The Shear Transform

Some PDFs produce italic-looking text by applying a shear matrix to an upright font rather than loading an actual italic variant.

The text item's transform array is [a, b, c, d, e, f]. The c component is horizontal shear. A non-zero c means the glyphs are slanted.

const syntheticItalic = Math.abs(item.transform[2]) > 0.01;
if (syntheticItalic) meta.italic = true;

This catches faux italic rendering that the font object alone would miss.


Underlines: Vector Segment Pairing

Underlines in PDFs are separate vector line segments drawn beneath text — not a font property. ctmAdapter.js classifies horizontal segments with a Y position within ~0.35× font size below a text item as underlines. The textMeta item receives underlined: true.


Propagating to the Renderer

Flags flow: geometryWorkertextMeta_scopeItemstextItems passed to textRebuilder. The rebuilder groups consecutive same-style items into runs:

function _wrapInlineStyle(text, style) {
    let html = _escHtml(text);
    if (style.underlined) html = &lt;u&gt;${html}&lt;/u&gt;;
    if (style.italic)     html = &lt;em&gt;${html}&lt;/em&gt;;
    if (style.bold)       html = &lt;strong&gt;${html}&lt;/strong&gt;;
    return html;
}

Nesting order: underline outermost, bold innermost — matching standard HTML precedence for browser rendering.


Result

A line like WARNING Do not proceed without reading the SAFETY section in a PDF might produce:

<strong>WARNING</strong> Do not proceed without reading the <em>SAFETY</em> section

No OCR pass. No ML font classifier. Just font metadata that was already inside the PDF.

Read this post in the full Engineering Journal →