HMM_New_Terms_Mining (Chinese Only)

Mining new terms in specific domains for the downstream NLP or other algorithm work

corpus_prep.py -- Set up the initial data preparation process and define the logic of the BMES marker.
Matrix_generator.py -- Generate the transfit matrix, emit matrix, and head matrix (h-mat is unnecessary yet) based on the corpus marked within BMES symbol.
HMM_Model.py -- Based on the matrix, mining the new terms in the testing corpus using viterbi. Users may use this to do term-mining sentence by sentence to fit some downstreaming NLP work
word_filter.py -- Handle some word-filterring work like short/long terms' overlap instances and low-document-frequency instances
main.py -- This is the method to mining new terms with a whole corpus document, so the model will dig out all the underlay proper new terms within a big set
odd_handle.py -- Used to check some odd new-terms, to inspect how the words located in all related sentence, what the attributes of components of the new-terms are. Thus, users can add some artificial new-terms by hand, so the model's corpus knowledge will be more sufficient.(This can be regarded as a remedial method)

config.txt description

data_name:

old: The domain terms we already have, to sharpen our jieba tokenizer (path: ./data)
new: The new corpus file you want to dig the new-terms, originally its in pd.dataframe format (path: ./data)
corp_col: The column name of corpus in the pd.dataframe file
corp_file: The file name of the corpus cleaned by code 1.

strategy:

keep_stopwords: boolean, to direct the model whether keep the stopwords. Once you change the parameter, please rerun the whole workflow 1. to 2.
threshold: int, to set up the threshold number of overlap-words filter

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HMM_New_Terms_Mining (Chinese Only)

Mining new terms in specific domains for the downstream NLP or other algorithm work

config.txt description

data_name:

strategy:

The whole workflow will be upgraded continuously. For now, its good to go.

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Mining		Mining
data		data
HMM_Model.py		HMM_Model.py
Matrix_generator.py		Matrix_generator.py
README.md		README.md
config.txt		config.txt
corpus_prep.py		corpus_prep.py
main.py		main.py
odd_handle.py		odd_handle.py
word_filter.py		word_filter.py

Folders and files

Latest commit

History

Repository files navigation

HMM_New_Terms_Mining (Chinese Only)

Mining new terms in specific domains for the downstream NLP or other algorithm work

config.txt description

data_name:

strategy:

The whole workflow will be upgraded continuously. For now, its good to go.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages