Skip to content

[DO NOT MERGE] Experimental Support for translating scrapped WARC files before ZIMing#546

Draft
ItzCobaltboy wants to merge 1 commit intoopenzim:mainfrom
ItzCobaltboy:translation-poc
Draft

[DO NOT MERGE] Experimental Support for translating scrapped WARC files before ZIMing#546
ItzCobaltboy wants to merge 1 commit intoopenzim:mainfrom
ItzCobaltboy:translation-poc

Conversation

@ItzCobaltboy
Copy link
Copy Markdown

Reference to #525

This pull request adds experimental support for translating WARC-archived HTML content to a target language (using Argos Translate), with integration into the main workflow and new dependencies. The translation is applied before the WARC-to-ZIM conversion, and the process can be triggered via a new command-line argument. Additionally, automated test commands are provided for both translated and untranslated workflows.

Translation feature integration:

  • Added a new module src/zimit/translate.py that uses Argos Translate and BeautifulSoup to translate HTML content in WARC files to a specified language, updating HTTP headers and handling encoding.
  • Integrated the translation step into the main workflow in src/zimit/zimit.py, allowing users to specify a --translate argument to translate all WARC files to a target language before conversion. [1] [2]
  • Added a utility function iter_warc_files to recursively yield WARC files from directories or file lists, supporting translation and processing.
  • Imported the new translation functionality into the main script.

Dependencies and environment:

  • Added beautifulsoup4, warcio, and argostranslate as dependencies in pyproject.toml to support translation and WARC processing.
  • Changed the Python version in the configuration from 3.14 to 3.11 for compatibility with new dependencies.
  • Dependancy issue: Argostranslate requires pydantic v1, which is not supported on Python 3.14>=, thus runtime is failing

Testing and automation:

  • Added a new script test_commands.sh to automate running the workflow with and without translation, verifying outputs.

@ItzCobaltboy
Copy link
Copy Markdown
Author

@benoit74 , I tried implementing a rough experimental design, however I ran into issues of version/dependancy clash.

OpenZIM works on python 3.14, however argos-translate uses PyDantic v1 which is not supported on Python >=3.14, leading to runtime clash

Any ideas what can be done?

@benoit74
Copy link
Copy Markdown
Collaborator

Pydantic v1 is too old, I suspect they use it to keep support for old Python version (Pydantic v2 requires 3.9, while argos-translate still supports 3.5).

No real suggestion on my end, this is typically a dead-end until argos-translate accepts to update to Pydantic v2 (and drop support for older Python versions). You can drop them a comment here: argosopentech/argos-translate#501 ; maybe they will accept to update to Pydantic v2, at least it is worth to ask.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants