Invalid utf-8 sequence in mecab-jumandic/AuxV.csv

Hello,

I wanted to make a copy of the mecab-jumandic and mecab-ipadic dictionaries with English labels so it is easier for me to read the output, but out of all csv files ``AuxV.csv`` from mecab-juman seems to have slight encoding issues,
the last 6 rows seems to have "invalid" or truncated utf-8 characters.

I am using mecab v0.996.

here are the bytes grabbed with the following:
```Python
with open('AuxV.csv', 'rb') as a:
    a.seek(0x104dB)
    dat = a.read()
    for elt in dat.split(b'\n'):
        print(elt)
```
which should give:
```
b'\xe3\x81\xa7\xe3\x81,627,627,10239,\xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe8\xaa\x9e\xe5\xb9\xb9,\xe3\x81\xa7\xe3\x81,\xe3\x81\xa7\xe3\x81,*'
b'\xe3\x81\xa7\xe3\x81,624,624,10239,\xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe5\x9f\xba\xe6\x9c\xac\xe5\xbd\xa2,\xe3\x81\xa7\xe3\x81,\xe3\x81\xa7\xe3\x81,*'
b'\xe3\x81\xbe\xe3\x81,628,628,13429,\xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe8\xaa\x9e\xe5\xb9\xb9,\xe3\x81\xbe\xe3\x81,\xe3\x81\xbe\xe3\x81,*'
b'\xe3\x81\xbe\xe3\x81,625,625,13429,\xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe5\x9f\xba\xe6\x9c\xac\xe5\xbd\xa2,\xe3\x81\xbe\xe3\x81,\xe3\x81\xbe\xe3\x81,*'
b'\xe3\x81\x93\xe3\x81\xa8\xe3\x81,626,626,9656,\xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe8\xaa\x9e\xe5\xb9\xb9,\xe3\x81\x93\xe3\x81\xa8\xe3\x81,\xe3\x81\x93\xe3\x81\xa8\xe3\x81,*'
b'\xe3\x81\x93\xe3\x81\xa8\xe3\x81,623,623,9656,\xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe5\x9f\xba\xe6\x9c\xac\xe5\xbd\xa2,\xe3\x81\x93\xe3\x81\xa8\xe3\x81,\xe3\x81\x93\xe3\x81\xa8\xe3\x81,*'
```
Notice on each row the second character of the slot 0 (as well as 8 and 9) in the csv has only 2 bytes (eg: for the first line ``\xe3\x81\xa7\xe3\x81``), Japaneses characters being encoded on 3 bytes in utf-8. 
Here are the problematic lines "decoded" with the 'Replacement Character' replacing the malformed character.
```
で�,627,627,10239,助動詞,*,無活用型,語幹,で�,で�,*
で�,624,624,10239,助動詞,*,無活用型,基本形,で�,で�,*
ま�,628,628,13429,助動詞,*,無活用型,語幹,ま�,ま�,*
ま�,625,625,13429,助動詞,*,無活用型,基本形,ま�,ま�,*
こと�,626,626,9656,助動詞,*,無活用型,語幹,こと�,こと�,*
こと�,623,623,9656,助動詞,*,無活用型,基本形,こと�,こと�,*
```

Edit: it seems that the file ``left-id.def`` and ``right-id.def`` in mecab-juman are also affected.
(both for the same lines, offset: 0x992c for both)
```
b'623 \xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe5\x9f\xba\xe6\x9c\xac\xe5\xbd\xa2,\xe3\x81\x93\xe3\x81\xa8\xe3\x81'
b'624 \xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe5\x9f\xba\xe6\x9c\xac\xe5\xbd\xa2,\xe3\x81\xa7\xe3\x81'
b'625 \xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe5\x9f\xba\xe6\x9c\xac\xe5\xbd\xa2,\xe3\x81\xbe\xe3\x81'
b'626 \xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe8\xaa\x9e\xe5\xb9\xb9,\xe3\x81\x93\xe3\x81\xa8\xe3\x81'
b'627 \xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe8\xaa\x9e\xe5\xb9\xb9,\xe3\x81\xa7\xe3\x81'
b'628 \xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe8\xaa\x9e\xe5\xb9\xb9,\xe3\x81\xbe\xe3\x81'
```
```
623 助動詞,*,無活用型,基本形,こと��
624 助動詞,*,無活用型,基本形,で��
625 助動詞,*,無活用型,基本形,ま��
626 助動詞,*,無活用型,語幹,こと��
627 助動詞,*,無活用型,語幹,で��
628 助動詞,*,無活用型,語幹,ま��
```

as well as ``model.def`` 
_note this is only a subset of  the affected lines._
```
-0.1929270437042460     B61:助動詞,*,無活用型,基本形,こと��/特殊,記号
-0.1929270437042460     B61:助動詞,*,無活用型,基本形,こと��/特殊,記号,*
-0.3165235222457874     B61:助動詞,*,無活用型,基本形,で��/特殊,記号
-0.3165235222457874     B61:助動詞,*,無活用型,基本形,で��/特殊,記号,*
-0.8171902874249221     B61:助動詞,*,無活用型,基本形,ま��/特殊,記号
-0.8171902874249221     B61:助動詞,*,無活用型,基本形,ま��/特殊,記号,*
-0.0518953821051652     B61:助動詞,*,無活用型,語幹,こと��/特殊,記号
-0.0518953821051652     B61:助動詞,*,無活用型,語幹,こと��/特殊,記号,*
-0.1104600796336064     B61:助動詞,*,無活用型,語幹,で��/特殊,記号
-0.1104600796336064     B61:助動詞,*,無活用型,語幹,で��/特殊,記号,*
-0.6066947509404275     B61:助動詞,*,無活用型,語幹,ま��/特殊,記号
-0.6066947509404275     B61:助動詞,*,無活用型,語幹,ま��/特殊,記号,*
```
the complete list can be found with the following snippet
```Python
with open('model.def', 'r', encoding="utf-8") as a:
    i=1
    l ="a"
    while len(l)!=0:
        try:
            l= a.readline()
        except UnicodeDecodeError:
            print(f"line {i}")
        i=i+1
```
If you need more informations, please let me know.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid utf-8 sequence in mecab-jumandic/AuxV.csv #81

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Invalid utf-8 sequence in mecab-jumandic/AuxV.csv #81

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions