-
Notifications
You must be signed in to change notification settings - Fork 240
Description
Hello,
I wanted to make a copy of the mecab-jumandic and mecab-ipadic dictionaries with English labels so it is easier for me to read the output, but out of all csv files AuxV.csv from mecab-juman seems to have slight encoding issues,
the last 6 rows seems to have "invalid" or truncated utf-8 characters.
I am using mecab v0.996.
here are the bytes grabbed with the following:
with open('AuxV.csv', 'rb') as a:
a.seek(0x104dB)
dat = a.read()
for elt in dat.split(b'\n'):
print(elt)which should give:
b'\xe3\x81\xa7\xe3\x81,627,627,10239,\xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe8\xaa\x9e\xe5\xb9\xb9,\xe3\x81\xa7\xe3\x81,\xe3\x81\xa7\xe3\x81,*'
b'\xe3\x81\xa7\xe3\x81,624,624,10239,\xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe5\x9f\xba\xe6\x9c\xac\xe5\xbd\xa2,\xe3\x81\xa7\xe3\x81,\xe3\x81\xa7\xe3\x81,*'
b'\xe3\x81\xbe\xe3\x81,628,628,13429,\xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe8\xaa\x9e\xe5\xb9\xb9,\xe3\x81\xbe\xe3\x81,\xe3\x81\xbe\xe3\x81,*'
b'\xe3\x81\xbe\xe3\x81,625,625,13429,\xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe5\x9f\xba\xe6\x9c\xac\xe5\xbd\xa2,\xe3\x81\xbe\xe3\x81,\xe3\x81\xbe\xe3\x81,*'
b'\xe3\x81\x93\xe3\x81\xa8\xe3\x81,626,626,9656,\xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe8\xaa\x9e\xe5\xb9\xb9,\xe3\x81\x93\xe3\x81\xa8\xe3\x81,\xe3\x81\x93\xe3\x81\xa8\xe3\x81,*'
b'\xe3\x81\x93\xe3\x81\xa8\xe3\x81,623,623,9656,\xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe5\x9f\xba\xe6\x9c\xac\xe5\xbd\xa2,\xe3\x81\x93\xe3\x81\xa8\xe3\x81,\xe3\x81\x93\xe3\x81\xa8\xe3\x81,*'
Notice on each row the second character of the slot 0 (as well as 8 and 9) in the csv has only 2 bytes (eg: for the first line \xe3\x81\xa7\xe3\x81), Japaneses characters being encoded on 3 bytes in utf-8.
Here are the problematic lines "decoded" with the 'Replacement Character' replacing the malformed character.
で�,627,627,10239,助動詞,*,無活用型,語幹,で�,で�,*
で�,624,624,10239,助動詞,*,無活用型,基本形,で�,で�,*
ま�,628,628,13429,助動詞,*,無活用型,語幹,ま�,ま�,*
ま�,625,625,13429,助動詞,*,無活用型,基本形,ま�,ま�,*
こと�,626,626,9656,助動詞,*,無活用型,語幹,こと�,こと�,*
こと�,623,623,9656,助動詞,*,無活用型,基本形,こと�,こと�,*
Edit: it seems that the file left-id.def and right-id.def in mecab-juman are also affected.
(both for the same lines, offset: 0x992c for both)
b'623 \xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe5\x9f\xba\xe6\x9c\xac\xe5\xbd\xa2,\xe3\x81\x93\xe3\x81\xa8\xe3\x81'
b'624 \xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe5\x9f\xba\xe6\x9c\xac\xe5\xbd\xa2,\xe3\x81\xa7\xe3\x81'
b'625 \xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe5\x9f\xba\xe6\x9c\xac\xe5\xbd\xa2,\xe3\x81\xbe\xe3\x81'
b'626 \xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe8\xaa\x9e\xe5\xb9\xb9,\xe3\x81\x93\xe3\x81\xa8\xe3\x81'
b'627 \xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe8\xaa\x9e\xe5\xb9\xb9,\xe3\x81\xa7\xe3\x81'
b'628 \xe5\x8a\xa9\xe5\x8b\x95\xe8\xa9\x9e,*,\xe7\x84\xa1\xe6\xb4\xbb\xe7\x94\xa8\xe5\x9e\x8b,\xe8\xaa\x9e\xe5\xb9\xb9,\xe3\x81\xbe\xe3\x81'
623 助動詞,*,無活用型,基本形,こと��
624 助動詞,*,無活用型,基本形,で��
625 助動詞,*,無活用型,基本形,ま��
626 助動詞,*,無活用型,語幹,こと��
627 助動詞,*,無活用型,語幹,で��
628 助動詞,*,無活用型,語幹,ま��
as well as model.def
note this is only a subset of the affected lines.
-0.1929270437042460 B61:助動詞,*,無活用型,基本形,こと��/特殊,記号
-0.1929270437042460 B61:助動詞,*,無活用型,基本形,こと��/特殊,記号,*
-0.3165235222457874 B61:助動詞,*,無活用型,基本形,で��/特殊,記号
-0.3165235222457874 B61:助動詞,*,無活用型,基本形,で��/特殊,記号,*
-0.8171902874249221 B61:助動詞,*,無活用型,基本形,ま��/特殊,記号
-0.8171902874249221 B61:助動詞,*,無活用型,基本形,ま��/特殊,記号,*
-0.0518953821051652 B61:助動詞,*,無活用型,語幹,こと��/特殊,記号
-0.0518953821051652 B61:助動詞,*,無活用型,語幹,こと��/特殊,記号,*
-0.1104600796336064 B61:助動詞,*,無活用型,語幹,で��/特殊,記号
-0.1104600796336064 B61:助動詞,*,無活用型,語幹,で��/特殊,記号,*
-0.6066947509404275 B61:助動詞,*,無活用型,語幹,ま��/特殊,記号
-0.6066947509404275 B61:助動詞,*,無活用型,語幹,ま��/特殊,記号,*
the complete list can be found with the following snippet
with open('model.def', 'r', encoding="utf-8") as a:
i=1
l ="a"
while len(l)!=0:
try:
l= a.readline()
except UnicodeDecodeError:
print(f"line {i}")
i=i+1If you need more informations, please let me know.