You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Mining new terms in specific domains for the downstream NLP or other algorithm work
corpus_prep.py -- Set up the initial data preparation process and define the logic of the BMES marker.
Matrix_generator.py -- Generate the transfit matrix, emit matrix, and head matrix (h-mat is unnecessary yet) based on the corpus marked within BMES symbol.
HMM_Model.py -- Based on the matrix, mining the new terms in the testing corpus using viterbi. Users may use this to do term-mining sentence by sentence to fit some downstreaming NLP work
word_filter.py -- Handle some word-filterring work like short/long terms' overlap instances and low-document-frequency instances
main.py -- This is the method to mining new terms with a whole corpus document, so the model will dig out all the underlay proper new terms within a big set
odd_handle.py -- Used to check some odd new-terms, to inspect how the words located in all related sentence, what the attributes of components of the new-terms are. Thus, users can add some artificial new-terms by hand, so the model's corpus knowledge will be more sufficient.(This can be regarded as a remedial method)
config.txt description
data_name:
old: The domain terms we already have, to sharpen our jieba tokenizer (path: ./data)
new: The new corpus file you want to dig the new-terms, originally its in pd.dataframe format (path: ./data)
corp_col: The column name of corpus in the pd.dataframe file
corp_file: The file name of the corpus cleaned by code 1.
strategy:
keep_stopwords: boolean, to direct the model whether keep the stopwords. Once you change the parameter, please rerun the whole workflow 1. to 2.
threshold: int, to set up the threshold number of overlap-words filter
The whole workflow will be upgraded continuously. For now, its good to go.
About
Mining new terms in specific domains for the downstream NLP or other algorithm work