Skip to content Skip to sidebar Skip to footer

Nlp Classification Labels Have Many Similarirites,replace To Only Have One

I been trying to use the fuzzywuzzy library in Python to find the percentage similarity between strings in the labels. The problem I am having is that there is still many strings t

Solution 1:

Using the function in this link, you can find a mapping as follows:

from fuzzywuzzy import fuzz


defreplace_similars(input_list):
    # Replaces %90 and more similar stringsfor i inrange(len(input_list)):
        for j inrange(len(input_list)):
            if i < j and fuzz.ratio(input_list[i], input_list[j]) >= 90:
                input_list[j] = input_list[i]


defgenerate_mapping(input_list):
    new_list = input_list[:]  # copy list
    replace_similars(new_list)

    mapping = {}
    for i inrange(len(input_list)):
        mapping[input_list[i]] = new_list[i]

    return mapping

Let's see how to use:

# Let's assume items in labels are unique.# If they are not unique, it will work anyway but will be slower.
labels = [
    "Cable replaced",
    "Cable replaced.",
    "Camera is up and recording",
    "Chat closed due to inactivity.",
    "Closing as duplicate",
    "Closing as duplicate.",
    "Closing duplicate ticket.",
    "Closing ticket.",
    "Completed",
    "Connection to IDF restored",
]

mapping = generate_mapping(labels)


# Print to see mappingprint("\n".join(["{:<50}: {}".format(k, v) for k, v in mapping.items()]))

Output:

Cable replaced                                    : Cable replaced
Cable replaced.                                   : Cable replaced
Camera is up and recording                        : Camera is up and recording
Chat closed due to inactivity.                    : Chat closed due to inactivity.
Closing as duplicate                              : Closing as duplicate
Closing as duplicate.                             : Closing as duplicate
Closing duplicate ticket.                         : Closing duplicate ticket.
Closing ticket.                                   : Closing ticket.
Completed                                         : Completed
Connection to IDF restored                        : Connection to IDF restored

So, you can find a mapping for h['resolution'].unique(), then update h['resolution'] column using this mapping. Since I don't have your dataframe, I can't try it. Based on this, I guess you can use the following:

fork, v in mapping.items():
    if k != v:
        h.loc[h['resolution'] == k, 'resolution'] = v

Post a Comment for "Nlp Classification Labels Have Many Similarirites,replace To Only Have One"