Skip to content

UnicodeDecodeError for multiple models #100

@nikit-srivastava

Description

@nikit-srivastava

Hello,

I am facing the following UnicodeDecodeError error:

File "/usr/src/app/server.py", line 188, in <module>
    application = make_app(args)
  File "/usr/src/app/server.py", line 166, in make_app
    worker_pool = initialize_workers(services)
  File "/usr/src/app/server.py", line 147, in initialize_workers
    worker_pool[lang_pair] = TranslatorInterface(
  File "/usr/src/app/server.py", line 17, in __init__
    self.contentprocessor = ContentProcessor(
  File "/usr/src/app/content_processor.py", line 18, in __init__
    self.bpe_source = BPE(BPEcodes)
  File "/usr/src/app/apply_bpe.py", line 37, in __init__
    firstline = codes.readline()
  File "/usr/local/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 54: invalid start byte

for the following models:

"it-en" : "https://object.pouta.csc.fi/OPUS-MT-models/it-en/opus-2019-12-18.zip" # SentencePiece
"ja-en" : "https://object.pouta.csc.fi/OPUS-MT-models/ja-en/opus-2019-12-18.zip" # SentencePiece
"id-en" : "https://object.pouta.csc.fi/OPUS-MT-models/id-en/opus-2019-12-18.zip" # SentencePiece
"bn-en" : "https://object.pouta.csc.fi/OPUS-MT-models/bn-en/opus-2020-02-11.zip" # SentencePiece
"et-en" : "https://object.pouta.csc.fi/OPUS-MT-models/et-en/opus-2019-12-18.zip" # SentencePiece
"lv-en" : "https://object.pouta.csc.fi/OPUS-MT-models/lv-en/opus-2019-12-18.zip" # SentencePiece
"th-en" : "https://object.pouta.csc.fi/OPUS-MT-models/th-en/opus-2020-01-16.zip" # SentencePiece
"uk-en" : "https://object.pouta.csc.fi/OPUS-MT-models/uk-en/opus-2020-01-16.zip" # SentencePiece

For most of them (except "lv-en") the error goes away when I switch to the BPE model. However, SentencePiece models are the ones with better translation performance as per the shared metrics.

Please let me know if I am doing something wrong.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions