mxbai-embed-large thinks “Hokkaidō” and “Éire” are the same word

A top-ranked MTEB embedding model collapses nearly every word containing a diacritical mark to a single point in vector space — but only when deployed via Ollama. The upstream HuggingFace tokenizer strips accents correctly; the Ollama gguf conversion pipeline silently drops the preprocessing step, and the defect class affects every BERT-derived embedding model in the local-inference stack we tested. 147,687 confirmed cross-entity collisions on Wikidata.
Last updated: 2026-04-11 23:43 UTC · regenerated from scripts/verify_tokenizer_divergence.py

The short version

mxbai-embed-large is a top-ranked open-source embedding model on MTEB. When deployed via Ollama — the dominant local-inference stack for embedding models — it has a silent tokenizer defect that collapses any short text containing diacritical marks (ō, é, ü, ł, ṣ, …) into a single point in embedding space.

Result: “Hokkaidō” (a Japanese island) and “Éire” (Ireland in Irish) produce identical embeddings — cosine similarity 1.000 to six decimal places. Meanwhile “Hokkaidō” has cosine similarity only 0.51 to its own ASCII spelling “Hokkaido”. The diacritical version of a word is closer to a random other diacritical word than to itself.

But via the upstream HuggingFace tokenizer, the defect does not reproduce. HF's BertTokenizer strip-accents preprocessing handles diacritics correctly. The bug lives in the conversion pipeline from HuggingFace to gguf / Ollama, not in the upstream mxbai weights. That makes it a defect class that affects every BERT-derived embedding model deployed via Ollama, not just this one model.

147,687
Cross-entity collisions (Wikidata)
0.904
Cosine (unrelated diacritic pairs)
0.511
Cosine (same word, diacritic vs ASCII)
3/3
BERT embed models affected via Ollama

About this page

Every number on this page is regenerated automatically. The pipeline is two scripts: scripts/verify_tokenizer_divergence.py runs the upstream HuggingFace tokenizer and the archived gguf via Ollama side-by-side and writes verification/tokenizer_divergence.json. Then scripts/generate_defect_page.py rebuilds the figures and this page from that JSON. The “archived” gguf refers to model/mxbai-embed-large-v1.gguf, shipped in this repo so the result is reproducible even if the upstream mxbai weights are patched.

1. The Collision Heatmap

Every cell shows the cosine similarity between two words containing diacritical marks. In a working embedding model, unrelated words should have cosine around 0.3–0.5. Instead, nearly every pair is at 1.00. The one cell around ~0.5 ("São Paulo") has an internal space that resets WordPiece on its second, ASCII-only word.

Collision heatmap
Figure 1: 361 of 380 off-diagonal cells are at cosine ≈ 1.00.

2. The Paradox: “Hokkaidō” is closer to “Éire” than to “Hokkaido”

Compare two cosines for each word in the 20-word sample: the word vs its own ASCII spelling (blue), and the word vs a different diacritical word (terracotta). A working model should put the blue bars near 1.0 and the terracotta bars near 0.5. The actual result is the exact inverse.

Bar chart inverted: unrelated diacritical words score higher than matched ASCII pairs
Figure 2: Blue = same word (diacritic vs plain), should be ≈1.0. Terracotta = different word (diacritic vs diacritic), should be ≈0.4. The actual values are inverted.

3. This is a hard collapse, not gradual degradation

If the failure mode were soft blur — diacritical marks adding noise that gradually erodes similarity — the number of colliding pairs would drop smoothly as the threshold is raised. It doesn't. The curve is flat all the way up to 0.99.

Threshold sweep curve is flat at the maximum
Figure 3: Collision count is nearly constant from cosine 0.80 to 0.99. The collapse is binary.

4. Distribution analysis: the [UNK] cluster

Three similarity distributions side by side. The diacritical-vs-diacritical panel (left) is a spike at 1.0, not a bell curve — compare to the control panel (middle), which shows what a healthy embedding space actually produces.

Three histograms: leftmost shows a spike at cosine 1.0, middle and right are normal distributions around 0.5
Figure 4: Diacritical-vs-diacritical (left) is a spike at 1.0. Control-vs-control (middle) is a normal distribution around 0.5. This is a tokenizer defect, not embedding noise.

5. Why it happens: the HF-vs-Ollama divergence

Step 1 — upstream HF BertTokenizer strips accents

Upstream HuggingFace BertTokenizer strips accents for 9 of 10 test pairs — the diacritical form and the ASCII form produce identical token IDs. (do_lower_case=True, strip_accents=None which defaults to accent-stripping when lower-casing is on.) The one non-identical pair involves Ł, a distinct Latin letter rather than a decomposable combining diacritical, so NFD normalization leaves it alone.

DiacriticalASCIIHF tokens (diacritic)HF tokens (ASCII)
Hokkaidō=Hokkaidohokkaidohokkaido
Éire=Eiree ##iree ##ire
Zürich=Zurichzurichzurich
café=cafecafecafe
Dvořák=Dvorakd ##vor ##akd ##vor ##ak
naïve=naivenaivenaive
São Paulo=Sao Paulosao paulosao paulo
Malmö=Malmomalmomalmo
Gdańsk=Gdanskgdanskgdansk
ŁódźLodzłodzlo ##d ##z

Step 2 — the archived gguf via Ollama does NOT strip accents

When the same mxbai weights are loaded through Ollama (from model/mxbai-embed-large-v1.gguf, registered as mxbai-archived via this repo's model/Modelfile), diacritical characters are preserved all the way into the WordPiece step. Since those characters are not in the WordPiece vocab and the gguf tokenizer has no character-level fallback, the whole whitespace-delimited token becomes [UNK]. For short inputs this single [UNK] dominates the mean-pooled embedding, and every diacritical string ends up at the same point:

"Hokkaidō"  →  [CLS]  [UNK]  [SEP]
"Éire"      →  [CLS]  [UNK]  [SEP]
"Zürich"    →  [CLS]  [UNK]  [SEP]
"café"      →  [CLS]  [UNK]  [SEP]
"Dvořák"    →  [CLS]  [UNK]  [SEP]

Empirically, on the archived gguf via Ollama, diacritical-vs-ASCII same-word cosine is 0.511 (should be ≈1.0 if the tokenizer is clean) and diacritical-vs-diacritical different-word cosine is 0.904 (should be ≈0.513, the ASCII control baseline).

The top cross-diacritic cosine similarities on the archived mxbai gguf via Ollama. A working model should score unrelated words around 0.3–0.5. These are the cosines actually produced on the reproducibly frozen weights:

Diacritical word ADiacritical word BCosine similarity
HokkaidōÉire1.0000
HokkaidōZürich1.0000
Hokkaidōcafé1.0000
HokkaidōDvořák1.0000
Hokkaidōnaïve1.0000
HokkaidōMalmö1.0000
HokkaidōGdańsk1.0000
HokkaidōŁódź1.0000

Step 3 — the root cause: a dropped preprocessing step

The mechanism is not a WordPiece limitation. HF's BertTokenizer applies BasicTokenizer's accent-stripping (via NFD normalization plus combining-mark removal) before WordPiece sees the string. That preprocessing is wired in when do_lower_case=True. The gguf conversion pipeline that produces the Ollama model drops this preprocessing step: the gguf tokenizer sees raw Unicode diacritics, has no way to match them to its WordPiece vocab, and emits [UNK].

Because the preprocessing step is a function of the BERT tokenizer config (not of any model-specific training), the same defect class is expected to affect every BERT-derived embedding model exported to gguf via the same conversion pipeline. The next section measures that.

6. Scope: this affects every BERT-derived embedding model in Ollama

The verification script runs the same diacritic-vs-ASCII probe against every BERT-family embedding model registered in local Ollama. Each row reports three mean cosine similarities:

A healthy model has S ≈ 1 and D ≈ C. A model with a diacritic attractor has D >> C. A model with an “[UNK] collapse” additionally has S ≈ C (the same word's ASCII form is no more similar than an unrelated word).

Model (via Ollama)S: diac↔ASCII
same word
D: diac↔diac
different words
C: ASCII
control baseline
SeverityFailure mode
mxbai-archived0.5110.9040.5130.78Full collapse
nomic-embed-text0.8880.9920.4260.58Diacritic attractor
all-minilm0.2400.8750.2141.32Full collapse

Every BERT-derived embedding model we tested via Ollama has a failure mode on diacritical text. mxbai-archived and all-minilm exhibit the full [UNK] collapse; nomic-embed-text has a softer but still-severe diacritic attractor (its unrelated diacritic pairs cluster at cosine ~0.99, even though it recognizes same-word ASCII equivalents). This is not a one-off bug in one model — it's a systemic defect class at the deployment-tooling layer.

7. Who is affected

DomainImpactExample
Multilingual NLP via Ollama Critical Any language with diacritics (French, German, Japanese romaji, Polish, Czech, Arabic transliteration…)
Knowledge graphs via Ollama Critical Wikidata entity labels with non-ASCII characters become indistinguishable
RAG / retrieval via Ollama High Documents about “Malmö” match queries about “Dvořák”
Semantic search via Ollama High Any product/person/place name with accented characters
Upstream HF transformers Unaffected HF's BertTokenizer strips accents in preprocessing — the bug lives below this layer
English-only ASCII workloads Unaffected Standard ASCII text works fine regardless of deployment layer

8. Reproducing this

Everything on this page regenerates from two scripts, using the frozen gguf shipped in this repository so the result is stable even if the upstream mxbai weights are patched:

# 1. Register the archived gguf in Ollama
cd model/
ollama create mxbai-archived -f Modelfile
cd ..

# 2. Run the verification script (writes verification/tokenizer_divergence.json)
pip install transformers   # for the upstream HF probe
python scripts/verify_tokenizer_divergence.py

# 3. Regenerate figures and this page from the JSON artifact
python scripts/generate_defect_page.py

The older single-file demo (scripts/demo_collisions.py) still works and is faster if you just want to see the defect — it embeds 25 pairs via Ollama and writes collisions.csv. That script is the one reproduced daily by GitHub Actions, and the CSV it produces is deterministic (the collisions do not drift between runs), which is itself part of the result.

9. What should be done

  1. Ollama / gguf conversion: BERT-derived models with do_lower_case=True need their BasicTokenizer preprocessing (NFD normalization + combining-mark strip) carried through gguf conversion. Without this, WordPiece sees raw diacritics and emits [UNK].
  2. Benchmark gap: MTEB should include diacritical / non-Latin string pairs as a robustness check — no current task in the suite surfaces this class of defect.
  3. User workaround: If you cannot switch off Ollama, NFD-normalize and strip combining marks on the client side before embedding. This loses linguistic information but prevents the collisions.
  4. Model choice: Models with byte-level BPE tokenizers (e.g. nomic-embed-text on some versions, or SentencePiece-based multilingual models) are less exposed, though our probe shows even nomic-embed-text via Ollama has a softer diacritic attractor.

10. Context

This defect was discovered during the Latent Space Cartography project, which applies Vector Symbolic Architecture (VSA) analysis to frozen text embeddings. When probing Wikidata entity embeddings for relational structure, the collision pattern was unmistakable: 147,687 cross-entity pairs at cosine ≥ 0.95, all involving diacritical text. The full analysis is documented in our paper “Latent Space Cartography Applied to Wikidata: Relational Displacement Analysis Reveals a Silent Tokenizer Defect in mxbai-embed-large” (clawRxiv 2604.00648).