Pdf Processor

Most Diff UIs Are Just Colored Spans. That Is Not a Diff.

2026-06-03

TLDR

Most browser diff implementations color changed text with CSS spans and call it done. The result looks like a diff but does not work like one: no line numbers, no row-level context, no synchronized panes, no collapsed unchanged blocks. Building a real line-by-line diff requires a different algorithm structure, not better CSS.

Repo: tools/pdf-processor

What the Industry Does

Most diff UIs use jsdiff or a similar library, call diffWords() or diffChars() on the full document text, and wrap each part in a <span class="added"> or <span class="removed">. Paste the result into two panes. Ship it.

This works for short inputs. For documents of any length it breaks immediately: the two panes get out of sync because removed lines in the left pane do not have corresponding empty rows in the right pane. Line numbers are impossible to add. The removed text in one pane and the added text in the other pane sit at different vertical positions. The user cannot scan corresponding lines across the two panes because there is no correspondence.

Why It Fails for Document Diff

The span-based approach treats the diff as a flat sequence of additions and removals. That model is correct for the algorithm but wrong for the display.

A document diff must be line-oriented. Readers scan documents line by line. When line 47 changed, the reader needs to see line 47 on both sides at the same vertical position, with line numbers confirming it is the same line. The corresponding lines must be vertically aligned.

A span-based renderer cannot produce this. It does not know about lines. It produces a stream of colored tokens that happens to contain newlines.

The Better Approach

Two passes instead of one.

Pass 1: line-level diff. Use diffArrays on the split lines of both documents. This gives you three kinds of output: equal lines (same on both sides), removed lines (only on the left), and added lines (only on the right).

Pass 2: pair the changed lines. A block of removed lines followed immediately by a block of added lines represents a change. Pair them 1-to-1 by index. Line 1 of the removed block pairs with line 1 of the added block. This pairing is what makes vertical alignment possible.

function buildPairs(lineDiffs) {
  const pairs = [];
  for (const part of lineDiffs) {
    if (!part.added && !part.removed) {
      for (const line of part.value) {
        pairs.push({ left: line, right: line, type: 'equal' });
      }
    } else if (part.removed) {
      // Buffer the removed block: it may be followed by an added block
      pairs.push({ _removed: part.value });
    } else if (part.added) {
      const prev = pairs[pairs.length - 1];
      if (prev?._removed) {
        // Pair removed and added lines 1-to-1
        const removed = prev._removed;
        const added = part.value;
        pairs.pop();
        const maxLen = Math.max(removed.length, added.length);
        for (let i = 0; i < maxLen; i++) {
          pairs.push({
            left:  i < removed.length ? removed[i] : null,
            right: i < added.length   ? added[i]   : null,
            type:  i < removed.length && i < added.length ? 'change' : (i < removed.length ? 'remove' : 'add'),
          });
        }
      } else {
        for (const line of part.value) {
          pairs.push({ left: null, right: line, type: 'add' });
        }
      }
    }
  }
  return pairs;
}

Each pair renders as one row: a line number gutter, a +/-/ glyph, and the line content. Removed lines get a red row background on the left; the corresponding right cell is empty. Added lines get a green row background on the right; the corresponding left cell is empty. Changed lines get red left and green right, with a third pass running diffWords on just that pair to mark the specific changed tokens inside the line.

What You Give Up

The pair-based model aligns changed lines by index, not by semantic meaning. If a block of 5 lines was removed and 3 lines were added, the last 2 pairs have null on the right side. The algorithm does not try to determine which removed lines best correspond to which added lines. That heuristic (used by git's patience diff algorithm) is more expensive and still not always correct.

For document-level comparison this is acceptable. The user can see what was removed and what replaced it. Perfect semantic alignment requires the Myers diff algorithm with its own set of tradeoffs.

When the Common Pattern Is Right

Span-based word diff is correct when you are comparing short strings, not documents. Single-cell editing, version history for a text field, autocomplete suggestion highlighting: all of these are flat string comparisons where line structure is irrelevant. The jsdiff span approach is exactly right for these cases. The problem is applying it to full documents where line alignment is the entire point of the comparison.

Read this post in the full Engineering Journal →