-
Notifications
You must be signed in to change notification settings - Fork 71
Description
I was using get_matches() to get top 5 matches. Now, since moving to production thought of using Fit/Predict but seems it returns only top first matches for each item. Is there any other way to get top 5 matches in Fit/Predict
I am matching current text notes (non-semantic long text) with historical ones. Historical data will be large in lakhs. So, to make code more efficient planning to pass historical text notes in fit and current text notes in transform. Planning to retrain it monthly.
Sample Current/Input Data:
<style> </style>| ID | Alloc_No | Text_Notes |
|---|---|---|
| 2354657 | 78 | RHJ…..//32456hjfg//vkcmEGHJJJYMM |
| 4354657 | 35 | TFHGDVASFHC4636587//5748UJKNM |
| 345676 | 889 | WUSERHIFKDJVN//23475//IUOSJDFGKV |
| 34747586 | 57 | YWEIHFDSK//2435467//WEKSFDHLV |
| 465768 | 3777 | 324TYVHBJN//435465//HUJNKHJKN |
Sample Reference/Historical Data:
<style> </style>| ID | Alloc_No | Text_Notes |
|---|---|---|
| 2354657 | 78 | RHJ…..//32456hjfg//vkcmEGHJJJYMM |
| 4354657 | 35 | TFHGDVASFHC4636587//5748UJKNM |
| 345676 | 889 | WUSERHIFKDJVN//23475//IUOSJDFGKV |
| 34747586 | 57 | YWEIHFDSK//2435467//WEKSFDHLV |
| 465768 | 3777 | 324TYVHBJN//435465//HUJNKHJKN |
| 2354657 | 78 | RHJ…..//32456hjfg//vkcmEGHJJJYMM |
| 4354657 | 35 | TFHGDVASFHC4636587//5748UJKNM |
| 345676 | 889 | WUSERHIFKDJVN//23475//IUOSJDFGKV |
| 34747586 | 57 | YWEIHFDSK//2435467//WEKSFDHLV |
| 465768 | 3777 | 324TYVHBJN//435465//HUJNKHJKN |
Sample Code:
Old Code using get_matches():
Passing reference historical text notes to "to_list"
to_list = Ref.Text_Notes.to_list()
for i in range(0,Input.shape[0]):
# Passing the new text notes one by one to get similarity score for all reference items and then get top 5 from it
from_list=[]
from_list.append(Input.Text_Notes[i])
#print(to_list)
model = PolyFuzz("TF-IDF").match(from_list, to_list)
matches=model.get_matches().sort_values(by='Similarity',ascending=False)
matches1=pd.merge(matches,Ref,left_index=True, right_index=True)
dict1=matches1[['ID','Similarity','Alloc_No','From','To']].to_dict('index')
list1=list(dict1.items())[:5]
dict2.update({Input['ID'][i]: list1})
dict2
New Changed Code for Production using Fit/Transform:
Fit the reference historical text notes
Frequency - monthly
from_list = Ref.Text_Notes.to_list()
model = PolyFuzz("TF-IDF")
model.fit(from_list)
model.save("TF-IDF")
Match the new text notes
Frequency - Daily
dict2 ={}
for i in range(0,Input.shape[0]):
to_list=[]
to_list.append(Input.Text_Notes[i])
model = PolyFuzz.load("TF-IDF")
matches=model.transform(to_list)
print(matches)
dict2
Now the issue is in transform i don't get similarity score for all reference rather only top 1 match whereas I need top 5