Skip to content

Fit/Transform does not give top n matches #81

@shaluchiipi

Description

@shaluchiipi

I was using get_matches() to get top 5 matches. Now, since moving to production thought of using Fit/Predict but seems it returns only top first matches for each item. Is there any other way to get top 5 matches in Fit/Predict

I am matching current text notes (non-semantic long text) with historical ones. Historical data will be large in lakhs. So, to make code more efficient planning to pass historical text notes in fit and current text notes in transform. Planning to retrain it monthly.

Sample Current/Input Data:

<style> </style>
ID Alloc_No Text_Notes
2354657 78 RHJ…..//32456hjfg//vkcmEGHJJJYMM
4354657 35 TFHGDVASFHC4636587//5748UJKNM
345676 889 WUSERHIFKDJVN//23475//IUOSJDFGKV
34747586 57 YWEIHFDSK//2435467//WEKSFDHLV
465768 3777 324TYVHBJN//435465//HUJNKHJKN

Sample Reference/Historical Data:

<style> </style>
ID Alloc_No Text_Notes
2354657 78 RHJ…..//32456hjfg//vkcmEGHJJJYMM
4354657 35 TFHGDVASFHC4636587//5748UJKNM
345676 889 WUSERHIFKDJVN//23475//IUOSJDFGKV
34747586 57 YWEIHFDSK//2435467//WEKSFDHLV
465768 3777 324TYVHBJN//435465//HUJNKHJKN
2354657 78 RHJ…..//32456hjfg//vkcmEGHJJJYMM
4354657 35 TFHGDVASFHC4636587//5748UJKNM
345676 889 WUSERHIFKDJVN//23475//IUOSJDFGKV
34747586 57 YWEIHFDSK//2435467//WEKSFDHLV
465768 3777 324TYVHBJN//435465//HUJNKHJKN

Sample Code:

Old Code using get_matches():

Passing reference historical text notes to "to_list"

to_list = Ref.Text_Notes.to_list()

for i in range(0,Input.shape[0]):

# Passing the new text notes one by one to get similarity score for all reference items and then get top 5 from it

from_list=[]
from_list.append(Input.Text_Notes[i])
#print(to_list)
model = PolyFuzz("TF-IDF").match(from_list, to_list)
matches=model.get_matches().sort_values(by='Similarity',ascending=False)
matches1=pd.merge(matches,Ref,left_index=True, right_index=True)
dict1=matches1[['ID','Similarity','Alloc_No','From','To']].to_dict('index')
list1=list(dict1.items())[:5]
dict2.update({Input['ID'][i]: list1})

dict2

New Changed Code for Production using Fit/Transform:

Fit the reference historical text notes

Frequency - monthly

from_list = Ref.Text_Notes.to_list()
model = PolyFuzz("TF-IDF")
model.fit(from_list)
model.save("TF-IDF")

Match the new text notes

Frequency - Daily

dict2 ={}
for i in range(0,Input.shape[0]):
to_list=[]
to_list.append(Input.Text_Notes[i])
model = PolyFuzz.load("TF-IDF")
matches=model.transform(to_list)
print(matches)
dict2

Now the issue is in transform i don't get similarity score for all reference rather only top 1 match whereas I need top 5

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions