Skip to content

The validation of Dictionary::assignUserDictionaryCosts() is inappropriate #76

@CookieBox26

Description

@CookieBox26

Problem

When using the UniDic dictionary and attempting to estimate the cost of user dictionaries, a validation error occurs at the following location.

CHECK_DIE(cid.left_size() == matrix.left_size() &&
cid.right_size() == matrix.right_size())
<< "Context ID files("
<< left_id_file
<< " or "
<< right_id_file << " may be broken: "
<< cid.left_size() << " " << matrix.left_size() << " "
<< cid.right_size() << " " << matrix.right_size();

dictionary.cpp(184) [cid.left_size() == matrix.left_size() && cid.right_size() ==
matrix.right_size()] Context ID files(C:/Program Files/MeCab/dic/unidic-csj-3.1.1-
full\left-id.def or C:/Program Files/MeCab/dic/unidic-csj-3.1.1-full\right-id.def
may be broken: 18552 15629 20859 15389

Causes and Solutions

This issue is due to the fact that the context_id is not unique for each line in the left_id_file (right_id_file). For instance, the left_id_file of unidic-csj-3.1.1-full is as follows:

7845 名詞,固有名詞,人名,姓,*,*,*,*,固,ツ促,促音形,*,1,*,*
7845 名詞,固有名詞,人名,姓,*,*,*,*,固,ツ促,基本形,*,1,*,*

Therefore, at the above-mentioned location, validation must be performed using the number of unique context_ids, not cid.left_size() (the number of lines in the left_id_file).

And it seems that the left and right are also reversed. Ideally, I believe it should be as follows:

  CHECK_DIE(cid.right_context_id_unique_size() == matrix.left_size() &&
            cid.left_context_id_unique_size()  == matrix.right_size())

A workaround for estimating the cost of user dictionaries involves only rewriting the first line of matrix.def and then rebuilding the user dictionary after cost estimation (pointed out in https://zenn.dev/zagvym/articles/28056236903369).
However, I believe that fixing the aforementioned validation location is the fundamental solution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions