-
Notifications
You must be signed in to change notification settings - Fork 240
Description
Problem
When using the UniDic dictionary and attempting to estimate the cost of user dictionaries, a validation error occurs at the following location.
mecab/mecab/src/dictionary.cpp
Lines 182 to 189 in 05481e7
| CHECK_DIE(cid.left_size() == matrix.left_size() && | |
| cid.right_size() == matrix.right_size()) | |
| << "Context ID files(" | |
| << left_id_file | |
| << " or " | |
| << right_id_file << " may be broken: " | |
| << cid.left_size() << " " << matrix.left_size() << " " | |
| << cid.right_size() << " " << matrix.right_size(); |
dictionary.cpp(184) [cid.left_size() == matrix.left_size() && cid.right_size() ==
matrix.right_size()] Context ID files(C:/Program Files/MeCab/dic/unidic-csj-3.1.1-
full\left-id.def or C:/Program Files/MeCab/dic/unidic-csj-3.1.1-full\right-id.def
may be broken: 18552 15629 20859 15389
Causes and Solutions
This issue is due to the fact that the context_id is not unique for each line in the left_id_file (right_id_file). For instance, the left_id_file of unidic-csj-3.1.1-full is as follows:
7845 名詞,固有名詞,人名,姓,*,*,*,*,固,ツ促,促音形,*,1,*,*
7845 名詞,固有名詞,人名,姓,*,*,*,*,固,ツ促,基本形,*,1,*,*
Therefore, at the above-mentioned location, validation must be performed using the number of unique context_ids, not cid.left_size() (the number of lines in the left_id_file).
And it seems that the left and right are also reversed. Ideally, I believe it should be as follows:
CHECK_DIE(cid.right_context_id_unique_size() == matrix.left_size() &&
cid.left_context_id_unique_size() == matrix.right_size())A workaround for estimating the cost of user dictionaries involves only rewriting the first line of matrix.def and then rebuilding the user dictionary after cost estimation (pointed out in https://zenn.dev/zagvym/articles/28056236903369).
However, I believe that fixing the aforementioned validation location is the fundamental solution.