-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Description
Bug Description
This is a REGRESSION bug: spaCy 3.3+'s EditTreeLemmatizer produces incorrect lemmas for Italian verbs that were correctly lemmatized in spaCy 3.2 and earlier with lookup-based lemmatizer.
Examples:
| Token | spaCy β€3.2 (lookup) | spaCy 3.3+ (EditTree) | Correct | Corpus Evidence |
|---|---|---|---|---|
| chiese | chiedere β | chiedereβ /chieseβ/chiendereβ | chiedere | 31% correct, 65% invented, 2% not lemmatized |
| chiesi | chiedere β | chiesiβ | chiedere | 100% not lemmatized |
| sedette | sedere β | sedereβ /sedetteβ/sedettereβ | sedere | 57% correct, 22% invented, 22% not lemmatized |
| sedetti | (missing) | sedettareβ | sedere | 100% invented |
| risedette | risedere β | risedetteβ | risedere | 100% not lemmatized |
Critical:
- "chiendere", "sedettere", "sedettare" are non-existent Italian verbs
- Correct lemmas exist in
spacy-lookups-data(except "sedetti" which is missing) but are ignored or inconsistently applied by EditTreeLemmatizer - Context-dependent behavior: Same token produces different lemmas unpredictably
Evidence of Regression
spaCy 3.2 and earlier (correct data exists):
The first two correct lemmas are present in the official lookup data:
https://raw.githubusercontent.com/explosion/spacy-lookups-data/refs/heads/master/spacy_lookups_data/data/it_lemma_lookup_verb.json
Verified on 16 January 2025:
$ curl -s https://raw.githubusercontent.com/explosion/spacy-lookups-data/refs/heads/master/spacy_lookups_data/data/it_lemma_lookup_verb.json | jq 'to_entries | map(select(.key == "chiese" or .key == "chiesi" or .key == "sedette" or .key == "sedetti" or .key == "risedette")) | map({key, value})'Output:
[
{
"key": "chiese",
"value": "chiedere" # β
Correct
},
{
"key": "chiesi",
"value": "chiedere" # β
Correct
},
{
"key": "risedette",
"value": "risedere" # β
Correct (non-reflexive form; "risedersi" is reflexive variant)
},
{
"key": "sedette",
"value": "sedere" # β
Correct
}
]
"sedetti": MISSING β
Key findings:
- β "chiese" β "chiedere" (correct, present in lookup)
- β "chiesi" β "chiedere" (correct, present in lookup)
- β "risedette" β "risedere" (correct, present in lookup)
- β "sedette" β "sedere" (correct, present in lookup)
- β "sedetti" β MISSING from lookup data
This reveals two distinct issues:
- Regression (chiese, sedette): Correct lookup data exists but is ignored by EditTreeLemmatizer
- Incomplete data (sedetti): Form missing from lookup + EditTreeLemmatizer produces invented lemma
How to reproduce the behaviour
spaCy 3.8.11 (regression):
Important note: Single-token tests may produce different results than
real corpus contexts due to EditTreeLemmatizer's context-dependent behavior
(for "chiese", "sedette", "risedette").
The corpus evidence below shows actual distribution in a literary text.
`import spacy
nlp = spacy.load("it_core_news_lg") # v3.8.11
Test 1: chiese β chiendere (invented verb)
doc = nlp('chiese')
print(f'Lemma of chiese: {doc[0].lemma_}')
Output: "Lemma of chiese: chiendere" # β INVENTED VERB and β in corpus evidence 65% errors
(context-dependency bug)
Test 2: chiesi β not lemmatized
doc = nlp('chiesi')
print(f'Lemma of chiesi: {doc[0].lemma_}')
Output: "Lemma of chiesi: chiesi" # β NOT LEMMATIZED and β in corpus evidence never lemmatized
Test 3: risedette β risedere (CORRECT)
doc = nlp('risedette')
print(f'Lemma of risedette: {doc[0].lemma_}')
Output: "Lemma of risedette: risedere" # β CORRECT but β in corpus evidence not lemmatized (context-dependency bug)
Test 4: sedette β not lemmatized
doc = nlp('sedette')
print(f'Lemma of sedette: {doc[0].lemma_}')
Output: "Lemma of sedette: sedette" # β NOT LEMMATIZED and β in corpus evidence INCONSISTENT BEHAVIOR with 22% INVENTED (context-dependency bug)
Test 5: sedetti β not lemmatized
doc = nlp('sedetti')
print(f'Lemma of sedetti: {doc[0].lemma_}')
Output: "Lemma of sedetti: sedettare" # β INVENTED VERB and β in corpus evidence 100% error
Lemmatizer info
lemmatizer = nlp.get_pipe("lemmatizer")
print(type(lemmatizer).name) # EditTreeLemmatizer
print(hasattr(lemmatizer, 'lookups')) # False`
Root Cause
Training Data Corruption
The EditTreeLemmatizer was trained on a corpus with incorrect annotations:
chiese β chiendere β (typo)
sedetti β sedettare β (wrong transformation)
Consequence: The model learned wrong edit trees from corrupted training data.
Lookup Data Ignored
The correct lemmas exist in spacy-lookups-data repository (stable for 3+ years based on git history), but:
EditTreeLemmatizerdoes not use lookup tables- Training corpus was not validated against correct lookup data
- No fallback to lookup for problematic verbs
Result: Regression from spaCy 3.2 (which used lookup tables correctly).
Impact
Severity: HIGH (Regression)
- Users upgrading from spaCy 3.2 or earlier get WORSE lemmatization
- Data corruption: Produces non-existent Italian verbs
- Common verbs affected:
- "chiedere" (to ask) - extremely frequent
- "sedere", "sedersi" (reflexive verb) (to sit) - common in narrative texts
- Corpus linguistics broken: Almost all lemma-dependent analysis corrupted
Real Corpus Evidence
Tested on 200K+ alphabetic token, Italian literary corpus (NiccolΓ² Ammaniti, Fango, 73K words and others (*) (c):
Detailed Breakdown (details in table below):
"chiese" (n=55 as VERB):
- Wrong lemma "chiendere": 36 occurrences (65%) β INVENTED VERB
- Correct lemma "chiedere": 17 (31%) β
- Not lemmatized "chiese": 1 (2%) β
- POS error (NOUN "chiesa" because of the token VERB "Chiese" is uppercase): 1 additional occurrence (excluded from VERB analysis) β
"chiesi" (n=6):
- Not lemmatized "chiesi": 6 (100%) β
- Expected: "chiedere" β (confirmed present in lookup data)
"sedette" (n=23) - INCONSISTENT BEHAVIOR:
- Correct "sedere": 13 (57%) β
- Invented "sedettere": 5 (22%) β
- Not lemmatized "sedette": 5 (22%) β
Note on inconsistency: Same token produces 3 different lemmas. This suggests context-dependent edit tree selection, where correct transformation is applied only in specific syntactic contexts (e.g., with reflexive pronouns, specific subjects). This unpredictability breaks fundamental assumption of corpus linguistics that same token should always have same lemma.
"sedetti" (n=1):
- Invented "sedettare": 1 (100%) β
- Lookup data: "sedetti" is MISSING (should be added)
"risedette" (n=1) - compound form:
- Lemma: "risedette" (not lemmatized, verified from corpus)
- Expected: "risedere" (verb with prefix "ri-")
Total problematic tokens: 86
- Invented verbs: 42 (49%) β SEVERE DATA CORRUPTION
- Not lemmatized: 12 (14%)
- Correct: 31 (36%)
- POS errors: 1 (1%)
49% of target tokens produce non-existent Italian verbs.
My Environment
Operating System macOS-15.6.1-arm64-arm-64bit
Python version: 3.9.6
spaCy version: 3.8.11
Pipelines: it_core_news_sm (3.8.0), it_core_news_lg (3.8.0)
Real Corpus Evidence - table (*):
| book | i | text | lemma | pos | is_alpha | error | MOST SEVERE |
|---|---|---|---|---|---|---|---|
| Fango | 1629 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 1904 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 4163 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 4622 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 7419 | chiese | chiedere | VERB | 1 | ||
| Fango | 7637 | chiese | chiedere | VERB | 1 | ||
| Fango | 8167 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 8796 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 12589 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 12798 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 12875 | chiese | chiedere | VERB | 1 | ||
| Fango | 14062 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 14222 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 16461 | chiese | chiedere | VERB | 1 | ||
| Fango | 18119 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 18423 | chiese | chiedere | VERB | 1 | ||
| Fango | 19226 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 19734 | chiese | chiese | VERB | 1 | ||
| Fango | 20218 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 22402 | chiese | chiedere | VERB | 1 | ||
| Fango | 22960 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 24677 | chiese | chiedere | VERB | 1 | ||
| Fango | 24770 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 26222 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 28286 | chiese | chiedere | VERB | 1 | ||
| Fango | 30916 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 31474 | chiese | chiedere | VERB | 1 | ||
| Fango | 31585 | chiese | chiedere | VERB | 1 | ||
| Fango | 32135 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 34126 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 38020 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 48416 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 48944 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 49414 | chiese | chiedere | VERB | 1 | ||
| Fango | 50065 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 57050 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 59638 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 60217 | chiese | chiedere | VERB | 1 | ||
| Fango | 63766 | Chiese | chiesa | NOUN | 1 | βNOT NOUN | |
| Fango | 65414 | chiese | chiedere | VERB | 1 | ||
| Fango | 65482 | chiese | chiedere | VERB | 1 | ||
| Fango | 66464 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 67194 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 69993 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 70124 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 72276 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 77247 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 77412 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 77754 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 78507 | chiese | chiedere | VERB | 1 | ||
| Fango | 79339 | chiese | chiedere | VERB | 1 | ||
| Fango | 80081 | chiese | chiedere | VERB | 1 | ||
| Fango | 81777 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 83309 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 84066 | chiese | chiendere | VERB | 1 | β | β |
| Fango | 88184 | chiesi | chiesi | VERB | 1 | ||
| Fango | 88297 | chiesi | chiesi | VERB | 1 | ||
| Fango | 90262 | chiesi | chiesi | VERB | 1 | ||
| Fango | 90301 | chiesi | chiesi | VERB | 1 | ||
| Fango | 90853 | chiesi | chiesi | VERB | 1 | ||
| Fango | 91177 | chiesi | chiesi | VERB | 1 | ||
| Fango | 86753 | risedette | risedette | VERB | 1 | β | |
| Fango | 2877 | sedette | sedere | VERB | 1 | ||
| Fango | 6158 | sedette | sedette | VERB | 1 | β | |
| Fango | 15611 | sedette | sedere | VERB | 1 | ||
| Fango | 19210 | sedette | sedette | VERB | 1 | β | |
| Fango | 29600 | sedette | sedettere | VERB | 1 | β | β |
| Fango | 35957 | sedette | sedette | VERB | 1 | β | |
| Fango | 46882 | sedette | sedette | VERB | 1 | β | |
| Fango | 50724 | sedette | sedere | VERB | 1 | ||
| Fango | 53211 | sedette | sedettere | VERB | 1 | β | β |
| Fango | 54928 | sedette | sedere | VERB | 1 | ||
| Fango | 55401 | sedette | sedere | VERB | 1 | ||
| Fango | 55563 | sedette | sedere | VERB | 1 | ||
| Fango | 56300 | sedette | sedere | VERB | 1 | ||
| Fango | 66436 | sedette | sedettere | VERB | 1 | β | β |
| Fango | 68023 | sedette | sedere | VERB | 1 | ||
| Fango | 68232 | sedette | sedettere | VERB | 1 | β | β |
| Fango | 69387 | sedette | sedette | VERB | 1 | β | |
| Fango | 71478 | sedette | sedere | VERB | 1 | ||
| Fango | 72944 | sedette | sedere | VERB | 1 | ||
| Fango | 77500 | sedette | sedere | VERB | 1 | ||
| Fango | 81073 | sedette | sedere | VERB | 1 | ||
| Fango | 82497 | sedette | sedere | VERB | 1 | ||
| Fango | 85486 | sedette | sedettere | VERB | 1 | β | β |
| Fango | 88212 | sedetti | sedettare | VERB | 1 | β | β |
(*) The original texts are the property of the author and publisher; this excerpt contains exclusively derived data and statistical analyses produced for university research purposes (Master in Digital Humanities).
Possible fixes
Fix 1: Validate Training Data Against Lookup Tables downgrade and not suggested for Italian language
Before training EditTreeLemmatizer:
- Extract lemma annotations from training corpus
- Validate against official "spacy-lookups-data"
- Flag mismatches for review
- Retrain with corrected annotations
Fix 2: Hybrid Lemmatizer (Short-term)
Add fallback to lookup tables for known problematic verbs
Fix 3: Post-Training Validation
Add validation layer that checks if output lemma is a valid Italian verb
Fix 4: Allow Lookup Override (User-facing)
Provide config option to use lookup tables for specific POS
Related Issues
- Previous issue about imperative verbs: similar suggestion to switch to lookup-based lemmatizer
- This seems to confirm lookup data is MORE RELIABLE than current EditTreeLemmatizer for Italian
References
Correct lookup data
3-4 years old?
https://raw.githubusercontent.com/explosion/spacy-lookups-data/refs/heads/master/spacy_lookups_data/data/it_lemma_lookup_verb.json
spaCy docs on switching lemmatizers:
https://spacy.io/models/#design-modify ### not suggested for Italian language
Workaround (temporary)
Users must implement post-processing:
LEMMA_FIXES = {
'chiendere': 'chiedere',
'sedettere': 'sedere',
'sedettare': 'sedere',
# ... more manual corrections
}This should not be necessary? Some correct data already exists in spacy-lookups-data..
Conclusion: EditTreeLemmatizer (introduced in spaCy 3.3+) ignores correct lookup data during training, causing regression from spaCy 3.2 and earlier versions.
(c)
The data were extracted and processed in accordance with art. 70-quater L. 633/1941 and art. 3 Directive (EU) 2019/790, regulating the exception for text and data mining for scientific research purposes.