Skip to content

Invalid utf-8 sequence in mecab-jumandic/AuxV.csv #81

@gooosedev

Description

@gooosedev

Hello,

I wanted to make a copy of the mecab-jumandic and mecab-ipadic dictionaries with English labels so it is easier for me to read the output, but out of all csv files AuxV.csv from mecab-juman seems to have slight encoding issues,
the last 6 rows seems to have "invalid" or truncated utf-8 characters.

I am using mecab v0.996.

here are the bytes grabbed with the following:

with open('AuxV.csv', 'rb') as a:
    a.seek(0x104dB)
    dat = a.read()
    for elt in dat.split(b'\n'):
        print(elt)

which should give:

b'\xe3\x81\xa7\xe3\x81,627,627,10239,\xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe8\xaa\x9e\xe5\xb9\xb9,\xe3\x81\xa7\xe3\x81,\xe3\x81\xa7\xe3\x81,*'
b'\xe3\x81\xa7\xe3\x81,624,624,10239,\xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe5\x9f\xba\xe6\x9c\xac\xe5\xbd\xa2,\xe3\x81\xa7\xe3\x81,\xe3\x81\xa7\xe3\x81,*'
b'\xe3\x81\xbe\xe3\x81,628,628,13429,\xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe8\xaa\x9e\xe5\xb9\xb9,\xe3\x81\xbe\xe3\x81,\xe3\x81\xbe\xe3\x81,*'
b'\xe3\x81\xbe\xe3\x81,625,625,13429,\xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe5\x9f\xba\xe6\x9c\xac\xe5\xbd\xa2,\xe3\x81\xbe\xe3\x81,\xe3\x81\xbe\xe3\x81,*'
b'\xe3\x81\x93\xe3\x81\xa8\xe3\x81,626,626,9656,\xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe8\xaa\x9e\xe5\xb9\xb9,\xe3\x81\x93\xe3\x81\xa8\xe3\x81,\xe3\x81\x93\xe3\x81\xa8\xe3\x81,*'
b'\xe3\x81\x93\xe3\x81\xa8\xe3\x81,623,623,9656,\xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe5\x9f\xba\xe6\x9c\xac\xe5\xbd\xa2,\xe3\x81\x93\xe3\x81\xa8\xe3\x81,\xe3\x81\x93\xe3\x81\xa8\xe3\x81,*'

Notice on each row the second character of the slot 0 (as well as 8 and 9) in the csv has only 2 bytes (eg: for the first line \xe3\x81\xa7\xe3\x81), Japaneses characters being encoded on 3 bytes in utf-8.
Here are the problematic lines "decoded" with the 'Replacement Character' replacing the malformed character.

で�,627,627,10239,助動詞,*,無活用型,語幹,で�,で�,*
で�,624,624,10239,助動詞,*,無活用型,基本形,で�,で�,*
ま�,628,628,13429,助動詞,*,無活用型,語幹,ま�,ま�,*
ま�,625,625,13429,助動詞,*,無活用型,基本形,ま�,ま�,*
こと�,626,626,9656,助動詞,*,無活用型,語幹,こと�,こと�,*
こと�,623,623,9656,助動詞,*,無活用型,基本形,こと�,こと�,*

Edit: it seems that the file left-id.def and right-id.def in mecab-juman are also affected.
(both for the same lines, offset: 0x992c for both)

b'623 \xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe5\x9f\xba\xe6\x9c\xac\xe5\xbd\xa2,\xe3\x81\x93\xe3\x81\xa8\xe3\x81'
b'624 \xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe5\x9f\xba\xe6\x9c\xac\xe5\xbd\xa2,\xe3\x81\xa7\xe3\x81'
b'625 \xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe5\x9f\xba\xe6\x9c\xac\xe5\xbd\xa2,\xe3\x81\xbe\xe3\x81'
b'626 \xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe8\xaa\x9e\xe5\xb9\xb9,\xe3\x81\x93\xe3\x81\xa8\xe3\x81'
b'627 \xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe8\xaa\x9e\xe5\xb9\xb9,\xe3\x81\xa7\xe3\x81'
b'628 \xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe8\xaa\x9e\xe5\xb9\xb9,\xe3\x81\xbe\xe3\x81'
623 助動詞,*,無活用型,基本形,こと��
624 助動詞,*,無活用型,基本形,で��
625 助動詞,*,無活用型,基本形,ま��
626 助動詞,*,無活用型,語幹,こと��
627 助動詞,*,無活用型,語幹,で��
628 助動詞,*,無活用型,語幹,ま��

as well as model.def
note this is only a subset of the affected lines.

-0.1929270437042460     B61:助動詞,*,無活用型,基本形,こと��/特殊,記号
-0.1929270437042460     B61:助動詞,*,無活用型,基本形,こと��/特殊,記号,*
-0.3165235222457874     B61:助動詞,*,無活用型,基本形,で��/特殊,記号
-0.3165235222457874     B61:助動詞,*,無活用型,基本形,で��/特殊,記号,*
-0.8171902874249221     B61:助動詞,*,無活用型,基本形,ま��/特殊,記号
-0.8171902874249221     B61:助動詞,*,無活用型,基本形,ま��/特殊,記号,*
-0.0518953821051652     B61:助動詞,*,無活用型,語幹,こと��/特殊,記号
-0.0518953821051652     B61:助動詞,*,無活用型,語幹,こと��/特殊,記号,*
-0.1104600796336064     B61:助動詞,*,無活用型,語幹,で��/特殊,記号
-0.1104600796336064     B61:助動詞,*,無活用型,語幹,で��/特殊,記号,*
-0.6066947509404275     B61:助動詞,*,無活用型,語幹,ま��/特殊,記号
-0.6066947509404275     B61:助動詞,*,無活用型,語幹,ま��/特殊,記号,*

the complete list can be found with the following snippet

with open('model.def', 'r', encoding="utf-8") as a:
    i=1
    l ="a"
    while len(l)!=0:
        try:
            l= a.readline()
        except UnicodeDecodeError:
            print(f"line {i}")
        i=i+1

If you need more informations, please let me know.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions