Skip to content

Update dependencies to support python 3.10 to 3.13#20

Merged
gsaint merged 15 commits intomainfrom
perso/gsaint/chore/dss14-sc-290422-make-nlp-text-preparation-plugin-python312
Jan 29, 2026
Merged

Update dependencies to support python 3.10 to 3.13#20
gsaint merged 15 commits intomainfrom
perso/gsaint/chore/dss14-sc-290422-make-nlp-text-preparation-plugin-python312

Conversation

@gsaint
Copy link
Contributor

@gsaint gsaint commented Jan 27, 2026

Add Python 3.10 to 3.13 support by introducing version-conditional dependencies for packages that have breaking changes or lack support across Python versions

Changes

  • Replace pycld3 with pycld2 for language detection on Python >= 3.10
    • Very close API and compatibility
    • Reuse pycld3 language ID mapping
  • Add spacy 3.x support
    • Add explicit lemmatizer component initialization for blank language models, which is an API change starting at 3.0
    • Use spacy 3.X for python >= 3.10
  • Fix pandas 2.0 compatibility: iteritems()items()
  • Fix NumPy 2.0 compatibility: np.NaNnp.nan
  • Testing
    • Expand unit test coverage to make sure we keep backward compatibility
    • Add make command to execute unit tests within debian containers across Python 3.6-3.13
    • Add GitHub Actions workflow to execute unit tests across Python 3.8-3.13

Local unit tests

Run unit tests on all supported python versions

make docker-test-all

Execute unit tests on a single python version

make docker-test-py312

Verification

  • Language detection
    • pycld3 and pycld2 are only used for text longer than 140 chars, otherwise existing code using langid is executed
  • Spell checking and text cleaning
    • Verify spaCy tokenization and lemmatization is working on p39 and from p310 to p313

@gsaint gsaint added the dependencies Pull requests that update a dependency file label Jan 27, 2026
lang_probability = float(language_detection_object[1])
return (lang_id, lang_probability)

def _cld_detection(self, doc: AnyStr) -> (AnyStr, float):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Core change for CLD

nlp = spacy.blank(language) # spaCy language without models (https://spacy.io/usage/models)
# spacy 3.x requires explicit lemmatizer component for blank languages
# Not all languages have lookup data, so we wrap in try/except
if spacy.about.__version__.startswith("3"):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Core change for spacy

@gsaint gsaint marked this pull request as ready for review January 28, 2026 13:41
Copy link
Contributor

@nicolasdalsass nicolasdalsass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python3.6 needs fixing

Great job on resurrecting unit tests with a proper setup for both local runs and github actions 👍

Comment on lines +172 to +181
# spacy 3.x requires explicit lemmatizer component for blank languages
# Not all languages have lookup data, so we wrap in try/except
if spacy.about.__version__.startswith("3"):
try:
nlp.add_pipe("lemmatizer", config={"mode": "lookup"})
nlp.initialize()
except Exception:
# Language doesn't support lookup lemmatization, continue without it
if "lemmatizer" in nlp.pipe_names:
nlp.remove_pipe("lemmatizer")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the logic behind "spacy require a lemmatizer, but on the other hand, we can just remove it if something goes wrong" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's a bit tricky and was raised by the unit tests.

This is only required for blank languages (languages without specific pre trained model that we can load), or when we decide use_models=False to keep it light weight on the memory.

We use lookup to automatically find a lemmatization for the language, however not all languages have lookup data, and when they don't it's raising an exception, so we fall back to no lemmatizer.

spacy[lookups,ja,th]==3.8.11; python_version >= '3.10'
symspellpy==6.7.0
tqdm==4.60.0
tqdm==4.66.3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wheel doesn't install for python3.6. Let's add a conditional requirement so that it actually installs properly on 3.6 since we keep supporting it.

Copy link
Contributor Author

@gsaint gsaint Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a013070

============================= 40 passed in 41.93s ==============================
[DONE] Python 3.6 tests completed

Good catch

@nicolasdalsass nicolasdalsass self-requested a review January 29, 2026 10:58
Copy link
Contributor

@nicolasdalsass nicolasdalsass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gsaint gsaint merged commit 256c39b into main Jan 29, 2026
8 checks passed
@gsaint gsaint deleted the perso/gsaint/chore/dss14-sc-290422-make-nlp-text-preparation-plugin-python312 branch January 29, 2026 13:09
@gsaint gsaint restored the perso/gsaint/chore/dss14-sc-290422-make-nlp-text-preparation-plugin-python312 branch January 29, 2026 13:10
@nicolasdalsass nicolasdalsass deleted the perso/gsaint/chore/dss14-sc-290422-make-nlp-text-preparation-plugin-python312 branch January 29, 2026 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants