[DO NOT MERGE] Experimental Support for translating scrapped WARC files before ZIMing#546
Draft
ItzCobaltboy wants to merge 1 commit intoopenzim:mainfrom
Draft
[DO NOT MERGE] Experimental Support for translating scrapped WARC files before ZIMing#546ItzCobaltboy wants to merge 1 commit intoopenzim:mainfrom
ItzCobaltboy wants to merge 1 commit intoopenzim:mainfrom
Conversation
Author
|
@benoit74 , I tried implementing a rough experimental design, however I ran into issues of version/dependancy clash. OpenZIM works on python 3.14, however Any ideas what can be done? |
Collaborator
|
Pydantic v1 is too old, I suspect they use it to keep support for old Python version (Pydantic v2 requires 3.9, while No real suggestion on my end, this is typically a dead-end until |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reference to #525
This pull request adds experimental support for translating WARC-archived HTML content to a target language (using Argos Translate), with integration into the main workflow and new dependencies. The translation is applied before the WARC-to-ZIM conversion, and the process can be triggered via a new command-line argument. Additionally, automated test commands are provided for both translated and untranslated workflows.
Translation feature integration:
src/zimit/translate.pythat uses Argos Translate and BeautifulSoup to translate HTML content in WARC files to a specified language, updating HTTP headers and handling encoding.src/zimit/zimit.py, allowing users to specify a--translateargument to translate all WARC files to a target language before conversion. [1] [2]iter_warc_filesto recursively yield WARC files from directories or file lists, supporting translation and processing.Dependencies and environment:
beautifulsoup4,warcio, andargostranslateas dependencies inpyproject.tomlto support translation and WARC processing.3.14>=, thus runtime is failingTesting and automation:
test_commands.shto automate running the workflow with and without translation, verifying outputs.