Nlp Classification Labels Have Many Similarirites,replace To Only Have One
I been trying to use the fuzzywuzzy library in Python to find the percentage similarity between strings in the labels. The problem I am having is that there is still many strings t
Solution 1:
Using the function in this link, you can find a mapping as follows:
from fuzzywuzzy import fuzz
defreplace_similars(input_list):
# Replaces %90 and more similar stringsfor i inrange(len(input_list)):
for j inrange(len(input_list)):
if i < j and fuzz.ratio(input_list[i], input_list[j]) >= 90:
input_list[j] = input_list[i]
defgenerate_mapping(input_list):
new_list = input_list[:] # copy list
replace_similars(new_list)
mapping = {}
for i inrange(len(input_list)):
mapping[input_list[i]] = new_list[i]
return mapping
Let's see how to use:
# Let's assume items in labels are unique.# If they are not unique, it will work anyway but will be slower.
labels = [
"Cable replaced",
"Cable replaced.",
"Camera is up and recording",
"Chat closed due to inactivity.",
"Closing as duplicate",
"Closing as duplicate.",
"Closing duplicate ticket.",
"Closing ticket.",
"Completed",
"Connection to IDF restored",
]
mapping = generate_mapping(labels)
# Print to see mappingprint("\n".join(["{:<50}: {}".format(k, v) for k, v in mapping.items()]))
Output:
Cable replaced : Cable replaced
Cable replaced. : Cable replaced
Camera is up and recording : Camera is up and recording
Chat closed due to inactivity. : Chat closed due to inactivity.
Closing as duplicate : Closing as duplicate
Closing as duplicate. : Closing as duplicate
Closing duplicate ticket. : Closing duplicate ticket.
Closing ticket. : Closing ticket.
Completed : Completed
Connection to IDF restored : Connection to IDF restored
So, you can find a mapping for h['resolution'].unique()
, then update h['resolution']
column using this mapping. Since I don't have your dataframe, I can't try it. Based on this, I guess you can use the following:
fork, v in mapping.items():
if k != v:
h.loc[h['resolution'] == k, 'resolution'] = v
Post a Comment for "Nlp Classification Labels Have Many Similarirites,replace To Only Have One"