Skip to content

[it] Regression: EditTreeLemmatizer produces wrong lemmas (spaCy 3.3+ worse than 3.2)Β #13913

@francescamilanini

Description

@francescamilanini

Bug Description

This is a REGRESSION bug: spaCy 3.3+'s EditTreeLemmatizer produces incorrect lemmas for Italian verbs that were correctly lemmatized in spaCy 3.2 and earlier with lookup-based lemmatizer.

Examples:

Token spaCy ≀3.2 (lookup) spaCy 3.3+ (EditTree) Correct Corpus Evidence
chiese chiedere βœ… chiedereβœ…/chiese❌/chiendere❌ chiedere 31% correct, 65% invented, 2% not lemmatized
chiesi chiedere βœ… chiesi❌ chiedere 100% not lemmatized
sedette sedere βœ… sedereβœ…/sedette❌/sedettere❌ sedere 57% correct, 22% invented, 22% not lemmatized
sedetti (missing) sedettare❌ sedere 100% invented
risedette risedere βœ… risedette❌ risedere 100% not lemmatized

Critical:

  • "chiendere", "sedettere", "sedettare" are non-existent Italian verbs
  • Correct lemmas exist in spacy-lookups-data (except "sedetti" which is missing) but are ignored or inconsistently applied by EditTreeLemmatizer
  • Context-dependent behavior: Same token produces different lemmas unpredictably

Evidence of Regression

spaCy 3.2 and earlier (correct data exists):

The first two correct lemmas are present in the official lookup data:
https://raw.githubusercontent.com/explosion/spacy-lookups-data/refs/heads/master/spacy_lookups_data/data/it_lemma_lookup_verb.json

Verified on 16 January 2025:

$ curl -s https://raw.githubusercontent.com/explosion/spacy-lookups-data/refs/heads/master/spacy_lookups_data/data/it_lemma_lookup_verb.json | jq 'to_entries | map(select(.key == "chiese" or .key == "chiesi" or .key == "sedette" or .key == "sedetti" or .key == "risedette")) | map({key, value})'

Output:

[
{
"key": "chiese",
"value": "chiedere" # βœ… Correct
},
{
"key": "chiesi",
"value": "chiedere" # βœ… Correct
},
{
"key": "risedette",
"value": "risedere" # βœ… Correct (non-reflexive form; "risedersi" is reflexive variant)
},
{
"key": "sedette",
"value": "sedere" # βœ… Correct
}
]

"sedetti": MISSING ❌

Key findings:

  1. βœ… "chiese" β†’ "chiedere" (correct, present in lookup)
  2. βœ… "chiesi" β†’ "chiedere" (correct, present in lookup)
  3. βœ… "risedette" β†’ "risedere" (correct, present in lookup)
  4. βœ… "sedette" β†’ "sedere" (correct, present in lookup)
  5. ❌ "sedetti" β†’ MISSING from lookup data

This reveals two distinct issues:

  • Regression (chiese, sedette): Correct lookup data exists but is ignored by EditTreeLemmatizer
  • Incomplete data (sedetti): Form missing from lookup + EditTreeLemmatizer produces invented lemma

How to reproduce the behaviour

spaCy 3.8.11 (regression):

Important note: Single-token tests may produce different results than
real corpus contexts due to EditTreeLemmatizer's context-dependent behavior
(for "chiese", "sedette", "risedette").
The corpus evidence below shows actual distribution in a literary text.

`import spacy
nlp = spacy.load("it_core_news_lg") # v3.8.11

Test 1: chiese β†’ chiendere (invented verb)

doc = nlp('chiese')
print(f'Lemma of chiese: {doc[0].lemma_}')

Output: "Lemma of chiese: chiendere" # ❌ INVENTED VERB and ❌ in corpus evidence 65% errors

(context-dependency bug)

Test 2: chiesi β†’ not lemmatized

doc = nlp('chiesi')
print(f'Lemma of chiesi: {doc[0].lemma_}')

Output: "Lemma of chiesi: chiesi" # ❌ NOT LEMMATIZED and ❌ in corpus evidence never lemmatized

Test 3: risedette β†’ risedere (CORRECT)

doc = nlp('risedette')
print(f'Lemma of risedette: {doc[0].lemma_}')

Output: "Lemma of risedette: risedere" # βœ… CORRECT but ❌ in corpus evidence not lemmatized (context-dependency bug)

Test 4: sedette β†’ not lemmatized

doc = nlp('sedette')
print(f'Lemma of sedette: {doc[0].lemma_}')

Output: "Lemma of sedette: sedette" # ❌ NOT LEMMATIZED and ❌ in corpus evidence INCONSISTENT BEHAVIOR with 22% INVENTED (context-dependency bug)

Test 5: sedetti β†’ not lemmatized

doc = nlp('sedetti')
print(f'Lemma of sedetti: {doc[0].lemma_}')

Output: "Lemma of sedetti: sedettare" # ❌ INVENTED VERB and ❌ in corpus evidence 100% error

Lemmatizer info

lemmatizer = nlp.get_pipe("lemmatizer")
print(type(lemmatizer).name) # EditTreeLemmatizer
print(hasattr(lemmatizer, 'lookups')) # False`

Root Cause

Training Data Corruption

The EditTreeLemmatizer was trained on a corpus with incorrect annotations:

chiese β†’ chiendere ❌ (typo)
sedetti β†’ sedettare ❌ (wrong transformation)

Consequence: The model learned wrong edit trees from corrupted training data.

Lookup Data Ignored

The correct lemmas exist in spacy-lookups-data repository (stable for 3+ years based on git history), but:

  1. EditTreeLemmatizer does not use lookup tables
  2. Training corpus was not validated against correct lookup data
  3. No fallback to lookup for problematic verbs

Result: Regression from spaCy 3.2 (which used lookup tables correctly).

Impact

Severity: HIGH (Regression)

  1. Users upgrading from spaCy 3.2 or earlier get WORSE lemmatization
  2. Data corruption: Produces non-existent Italian verbs
  3. Common verbs affected:
    • "chiedere" (to ask) - extremely frequent
    • "sedere", "sedersi" (reflexive verb) (to sit) - common in narrative texts
  4. Corpus linguistics broken: Almost all lemma-dependent analysis corrupted

Real Corpus Evidence

Tested on 200K+ alphabetic token, Italian literary corpus (NiccolΓ² Ammaniti, Fango, 73K words and others (*) (c):

Detailed Breakdown (details in table below):

"chiese" (n=55 as VERB):

  • Wrong lemma "chiendere": 36 occurrences (65%) ❌ INVENTED VERB
  • Correct lemma "chiedere": 17 (31%) βœ…
  • Not lemmatized "chiese": 1 (2%) ❌
  • POS error (NOUN "chiesa" because of the token VERB "Chiese" is uppercase): 1 additional occurrence (excluded from VERB analysis) ❌

"chiesi" (n=6):

  • Not lemmatized "chiesi": 6 (100%) ❌
  • Expected: "chiedere" βœ… (confirmed present in lookup data)

"sedette" (n=23) - INCONSISTENT BEHAVIOR:

  • Correct "sedere": 13 (57%) βœ…
  • Invented "sedettere": 5 (22%) ❌
  • Not lemmatized "sedette": 5 (22%) ❌

Note on inconsistency: Same token produces 3 different lemmas. This suggests context-dependent edit tree selection, where correct transformation is applied only in specific syntactic contexts (e.g., with reflexive pronouns, specific subjects). This unpredictability breaks fundamental assumption of corpus linguistics that same token should always have same lemma.

"sedetti" (n=1):

  • Invented "sedettare": 1 (100%) ❌
  • Lookup data: "sedetti" is MISSING (should be added)

"risedette" (n=1) - compound form:

  • Lemma: "risedette" (not lemmatized, verified from corpus)
  • Expected: "risedere" (verb with prefix "ri-")

Total problematic tokens: 86

  • Invented verbs: 42 (49%) ← SEVERE DATA CORRUPTION
  • Not lemmatized: 12 (14%)
  • Correct: 31 (36%)
  • POS errors: 1 (1%)

49% of target tokens produce non-existent Italian verbs.

My Environment

Operating System macOS-15.6.1-arm64-arm-64bit
Python version: 3.9.6
spaCy version: 3.8.11
Pipelines: it_core_news_sm (3.8.0), it_core_news_lg (3.8.0)

Real Corpus Evidence - table (*):

book i text lemma pos is_alpha error MOST SEVERE
Fango 1629 chiese chiendere VERB 1 ❌ ❌
Fango 1904 chiese chiendere VERB 1 ❌ ❌
Fango 4163 chiese chiendere VERB 1 ❌ ❌
Fango 4622 chiese chiendere VERB 1 ❌ ❌
Fango 7419 chiese chiedere VERB 1
Fango 7637 chiese chiedere VERB 1
Fango 8167 chiese chiendere VERB 1 ❌ ❌
Fango 8796 chiese chiendere VERB 1 ❌ ❌
Fango 12589 chiese chiendere VERB 1 ❌ ❌
Fango 12798 chiese chiendere VERB 1 ❌ ❌
Fango 12875 chiese chiedere VERB 1
Fango 14062 chiese chiendere VERB 1 ❌ ❌
Fango 14222 chiese chiendere VERB 1 ❌ ❌
Fango 16461 chiese chiedere VERB 1
Fango 18119 chiese chiendere VERB 1 ❌ ❌
Fango 18423 chiese chiedere VERB 1
Fango 19226 chiese chiendere VERB 1 ❌ ❌
Fango 19734 chiese chiese VERB 1
Fango 20218 chiese chiendere VERB 1 ❌ ❌
Fango 22402 chiese chiedere VERB 1
Fango 22960 chiese chiendere VERB 1 ❌ ❌
Fango 24677 chiese chiedere VERB 1
Fango 24770 chiese chiendere VERB 1 ❌ ❌
Fango 26222 chiese chiendere VERB 1 ❌ ❌
Fango 28286 chiese chiedere VERB 1
Fango 30916 chiese chiendere VERB 1 ❌ ❌
Fango 31474 chiese chiedere VERB 1
Fango 31585 chiese chiedere VERB 1
Fango 32135 chiese chiendere VERB 1 ❌ ❌
Fango 34126 chiese chiendere VERB 1 ❌ ❌
Fango 38020 chiese chiendere VERB 1 ❌ ❌
Fango 48416 chiese chiendere VERB 1 ❌ ❌
Fango 48944 chiese chiendere VERB 1 ❌ ❌
Fango 49414 chiese chiedere VERB 1
Fango 50065 chiese chiendere VERB 1 ❌ ❌
Fango 57050 chiese chiendere VERB 1 ❌ ❌
Fango 59638 chiese chiendere VERB 1 ❌ ❌
Fango 60217 chiese chiedere VERB 1
Fango 63766 Chiese chiesa NOUN 1 ❌NOT NOUN
Fango 65414 chiese chiedere VERB 1
Fango 65482 chiese chiedere VERB 1
Fango 66464 chiese chiendere VERB 1 ❌ ❌
Fango 67194 chiese chiendere VERB 1 ❌ ❌
Fango 69993 chiese chiendere VERB 1 ❌ ❌
Fango 70124 chiese chiendere VERB 1 ❌ ❌
Fango 72276 chiese chiendere VERB 1 ❌ ❌
Fango 77247 chiese chiendere VERB 1 ❌ ❌
Fango 77412 chiese chiendere VERB 1 ❌ ❌
Fango 77754 chiese chiendere VERB 1 ❌ ❌
Fango 78507 chiese chiedere VERB 1
Fango 79339 chiese chiedere VERB 1
Fango 80081 chiese chiedere VERB 1
Fango 81777 chiese chiendere VERB 1 ❌ ❌
Fango 83309 chiese chiendere VERB 1 ❌ ❌
Fango 84066 chiese chiendere VERB 1 ❌ ❌
Fango 88184 chiesi chiesi VERB 1
Fango 88297 chiesi chiesi VERB 1
Fango 90262 chiesi chiesi VERB 1
Fango 90301 chiesi chiesi VERB 1
Fango 90853 chiesi chiesi VERB 1
Fango 91177 chiesi chiesi VERB 1
Fango 86753 risedette risedette VERB 1 ❌
Fango 2877 sedette sedere VERB 1
Fango 6158 sedette sedette VERB 1 ❌
Fango 15611 sedette sedere VERB 1
Fango 19210 sedette sedette VERB 1 ❌
Fango 29600 sedette sedettere VERB 1 ❌ ❌
Fango 35957 sedette sedette VERB 1 ❌
Fango 46882 sedette sedette VERB 1 ❌
Fango 50724 sedette sedere VERB 1
Fango 53211 sedette sedettere VERB 1 ❌ ❌
Fango 54928 sedette sedere VERB 1
Fango 55401 sedette sedere VERB 1
Fango 55563 sedette sedere VERB 1
Fango 56300 sedette sedere VERB 1
Fango 66436 sedette sedettere VERB 1 ❌ ❌
Fango 68023 sedette sedere VERB 1
Fango 68232 sedette sedettere VERB 1 ❌ ❌
Fango 69387 sedette sedette VERB 1 ❌
Fango 71478 sedette sedere VERB 1
Fango 72944 sedette sedere VERB 1
Fango 77500 sedette sedere VERB 1
Fango 81073 sedette sedere VERB 1
Fango 82497 sedette sedere VERB 1
Fango 85486 sedette sedettere VERB 1 ❌ ❌
Fango 88212 sedetti sedettare VERB 1 ❌ ❌

(*) The original texts are the property of the author and publisher; this excerpt contains exclusively derived data and statistical analyses produced for university research purposes (Master in Digital Humanities).

Possible fixes

Fix 1: Validate Training Data Against Lookup Tables downgrade and not suggested for Italian language

Before training EditTreeLemmatizer:

  1. Extract lemma annotations from training corpus
  2. Validate against official "spacy-lookups-data"
  3. Flag mismatches for review
  4. Retrain with corrected annotations

Fix 2: Hybrid Lemmatizer (Short-term)

Add fallback to lookup tables for known problematic verbs

Fix 3: Post-Training Validation

Add validation layer that checks if output lemma is a valid Italian verb

Fix 4: Allow Lookup Override (User-facing)

Provide config option to use lookup tables for specific POS

Related Issues

  • Previous issue about imperative verbs: similar suggestion to switch to lookup-based lemmatizer
  • This seems to confirm lookup data is MORE RELIABLE than current EditTreeLemmatizer for Italian

References

Correct lookup data
3-4 years old?
https://raw.githubusercontent.com/explosion/spacy-lookups-data/refs/heads/master/spacy_lookups_data/data/it_lemma_lookup_verb.json

spaCy docs on switching lemmatizers:
https://spacy.io/models/#design-modify ### not suggested for Italian language

Workaround (temporary)

Users must implement post-processing:

LEMMA_FIXES = {
    'chiendere': 'chiedere',
    'sedettere': 'sedere',
    'sedettare': 'sedere',
    # ... more manual corrections
}

This should not be necessary? Some correct data already exists in spacy-lookups-data..

Conclusion: EditTreeLemmatizer (introduced in spaCy 3.3+) ignores correct lookup data during training, causing regression from spaCy 3.2 and earlier versions.

(c)
The data were extracted and processed in accordance with art. 70-quater L. 633/1941 and art. 3 Directive (EU) 2019/790, regulating the exception for text and data mining for scientific research purposes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions