Glossa

Pathological term names

This fixture stresses the longest-first rule, term-name characters that collide with the surrounding markup, and overlap cases where one term is a prefix of another.

Overlapping prefixes

The three Tier\* terms — Tier, Tier 1, and Tier 1 endpoint — must each occur in the prose without one shadowing another. When the walker sees "Tier 1 endpoint" in the text, the Tier 1 endpoint mark must win the race. A reader scanning down should see all three marks at the appropriate moments — "Tier" alone in one paragraph, "Tier 1" in another (without being absorbed by "Tier"), and "Tier 1 endpoint" in a third (without being absorbed by either shorter form).

Similarly stop and stopword overlap. set and Set-Cookie overlap with case-sensitivity in play. key-value and key-value store overlap with whitespace boundary in play. JSON-RPC and JSON-RPC 2.0 overlap with version-suffix boundary in play. ML pipeline and ML pipeline v2 overlap with version-suffix.

A paragraph mentioning all of them: when you ship a key-value store backed by JSON-RPC 2.0, you'll want to stop the request at the edge if it carries a stopword, and you'll want to set the Set-Cookie header carefully. The ML pipeline v2 ingests this stream; the ML pipeline (the predecessor) does not. Tier 1 endpoint customers see the new pipeline; Tier 1 customers see the old one; Tier customers — that is, all customers in any tier — see nothing here.

Terms that collide with HTML element names

Two terms in this fixture are named after HTML elements: table and code.

The term table appears in prose like "the data table format" and "set the table". The walker must mark the word "table" in surrounding text without re-entering an actual <table> element. Below is an actual table that contains the word "table" in a cell — the cell text should still get marked, but only on its first wrap.

TypeNotes about the tableExample
flatThe table is a flat layout.rows × columns
nestedA table inside a table cell.not used in this fixture
emptyThe table has no rows.edge case

The term code appears in prose like "below is a code example" and "the code review process". A fenced code block follows; text inside the fence should NOT be marked (it's <code> content):


This is a code block. The word "code" appears here three times: code, code, code.
The walker should skip every word inside this fence.

Inline code with the word in it: the code variable holds a string. Inline <code> is also a no-mark zone.

After the code fence, prose resumes — "Now we are back to normal prose where the word code should be marked again."

Regex-meta and punctuation in term names

C# is a term with a # in it. F' has an apostrophe. The walker's regex builder must escape these characters when constructing the pattern. Naive concatenation produces broken regexes that either fail to match or match too much.

In a sentence: developers writing C# bindings for the F' framework occasionally complain that the symbol C# clashes with the URL fragment syntax in MDX rendering.

Single-character terms

K is a single-character term. Single-character terms are dangerous — every K-shaped letter risks getting marked. The fixture deliberately puts K in many positions: K, k, Kk, kK, OK, KOA, KKK. Policy in plan/00_processor.md says the walker uses word-boundary matching for single-char terms; this fixture exercises that.

In a sentence: the constant K (uppercase only) is the cosine-similarity threshold. The lowercase k is just a letter. The word OK contains a K but is not the term K. KKK is three Ks but should mark three times if word-boundary policy allows.

Common English words as terms

data is a common English word. The fixture says "data" many times: the data ingestion pipeline, the data lake, your data is your data, what data does data hold? Marking every occurrence would drown the rail. Policy in plan/02reviewui.md says common-word terms get a "review needed" flag at extraction time but, once accepted, are marked normally. This fixture exercises the marking, not the flagging.

Hyphenated and version-suffixed names

JSON-RPC vs JSON-RPC 2.0. ML pipeline vs ML pipeline v2. key-value vs key-value store. stop vs stopword. These pairs each test that the longer form beats the shorter form when both are candidates at the same text position.

ASCII art and verbatim content

ASCII art appears in this paragraph. It also appears inside a fenced block below, where it should NOT be marked:


ASCII art is fun: ___
                 /   \   ASCII art
                 \___/

After the fence, ASCII art prose resumes and should mark again.

Density wrap-up

This fixture is short on word count but heavy on adversarial term-name patterns. The 20 terms in the glossary should each get at least one mark in the post-hydration DOM. The expectedTerms array in manifest.json lists them by id; the assertion suite walks the rail rows and confirms each id is present with at least one matched <mark> in the prose.