Outlier Prediction With Categorical Data In Pythons Scikit-Learn Lib
Solution 1:
I will try to make a list of observations that you will maybe find useful:
- LabelEncoder can be used, for example, to transform non-numerical data into numerical labels. OneHotEncoder usually takes numerical or non-numerical data and converts it into, well, one-hot encodings. Both are usually used for preprocessing the "labels" (classes of a supervised learning problem).
- As I understand it, you are trying to predict outliers (anomaly detection). It is not clear to me if the connection between the utterances and the integers is only hardcoded or if you want to generate this kind of connection somehow. If this is what you want, then you cannot achieve this using previously mentioned encoders because you are fitting them on some data (that, in general, should be labels) and trying to transform new unrelated data (ValueError: y contains previously unseen labels). However, this can be fixed by setting the handle_unknown parameter of OneHotEncoder to 'ignore' (From Documentation: "Whether to raise an error or ignore if an unknown categorical feature is present during transform"). Even if you can achieve what you want with one of these Encoders, you should keep in mind that this is not the main purpose of it.
I assume you are giving a high value to "negative" utterances (even if "wrong" doesn't correspond to 65 in your train data) and a small value to "positive" ones. If you assume you already know every integer for every utterance you can train the model on what is considered "positive" examples and give "negative" examples (outliers) only in testing. You don't train an IsolationForest on "positive" and "negative" examples - this would just basic binary classification that can be modelled with a Decision Tree for example. An intuitive example of IsolationForest can be seen here. Below is the code for your problem:
import numpy as np from sklearn.ensemble import IsolationForest textual_data = ['i love you', 'I love your dress', 'i like that', 'thats good', 'amazing', ...] integer_connection = [1, 1, 2, 3, 2, 2, 3, 1, 3, 4, 1, 2, 1, 2, 1, 2, 1, 1] integer_connection = np.array([[n] for n in integer_connection]) isolation_forest = IsolationForest(contamination = 'auto', behaviour = 'new') isolation_forest.fit(integer_encoded) list_of_val = [['good work', 2], ['you are wrong', 54], ['this was amazing', 1]] text_vals = [d[0] for d in list_of_val] numeric_vals = np.array([[d[1]] for d in list_of_val]) print(integer_encoded, numeric_vals) outliers = isolation_forest.predict(numeric_vals) print(outliers)
In general, I don't think your approach is right regarding outliers prediction for natural language utterances. For what you are trying to do in this specific example I can recommend using word vectors similarity from, for example, spaCy, or maybe a simple bag of words approach.
If you don't care of any of these points and you only want a working code, here is my version of what you are trying to do:
import numpy as np from sklearn.ensemble import IsolationForest from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, LabelEncoder textual_data = ['i love you', 'I love your dress', 'i like that', 'thats good', 'amazing', 'wrong', 'hi, how are you, are you doing good'] encodings = {} num_data = [4, 1, 3, 2, 65, 3, 3] onehot_encoder = OneHotEncoder(handle_unknown='ignore') onehots = onehot_encoder.fit_transform(np.array([[utt, no] for utt, no in zip(textual_data, num_data)])) for i, l in enumerate(onehots): original_label = (textual_data[i], num_data[i]) encodings[original_label] = l print(encodings) isolation_forest = IsolationForest(contamination = 'auto', behaviour = 'new') model = isolation_forest.fit(onehots) list_of_val = [['good work', 2], ['you are wrong', 54], ['this was amazing', 1]] test_encoded = onehot_encoder.transform(np.array(list_of_val)) print(test_encoded) outliers = isolation_forest.predict(test_encoded) print(outliers) for i, outlier in enumerate(outliers): if outlier == -1: print('Values', list_of_val[i], 'are outliers') else: print('Values', list_of_val[i], 'are not outliers')
Solution 2:
Are you sure it makes sense what you are doing? Your OneHotEncoder()
encodes your categorical variable ('my text'
) using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. Think of it as a mapping between your labels and a numeric return.
In your textual_data
you have 7 different labels: ['i love you', 'I love your dress', 'i like that', 'thats good', 'amazing', 'wrong', 'hi, how are you, are you doing good']
. Each of these will be encoded. This happens during your:
>>> x = encoder.fit_transform(x)
>>> print(x)
<7x8 sparse matrix of type '<class 'numpy.float64'>'
with 14 stored elements in Compressed Sparse Row format>
Here your encoder creates a mapping for all 7 labels.
When you continue with your script and want to use that same encoder to transform a new label it fails:
>>> to_predict = pd.DataFrame({'my text': ['good work', 'you are wrong', 'this was amazing'],
'num data': [2, 54, 1]})
>>> encoder.transform(to_predict)
ValueError: Found unknown categories ['this was amazing', 'good work', 'you are wrong'] in column 0 during transform
It can't find those labels in its mapping. However if you have new observations for which your labels are part of your mapping it would be able to transform them:
>>> to_predict = pd.DataFrame({'my text': ['i like that', 'i love you', 'i love you'],
'num data': [2, 54, 1]})
>>> encoder.transform(to_predict)
<3x8 sparse matrix of type '<class 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Row format>
What you could do instead is add those new observations with new labels to your original df
and run them again through your pipeline so they become part of your mapping.
I must admit that I'm not experienced at all with this so please correct me if I'm wrong, but that's the way it looks to me. Good luck with your project.
Solution 3:
you have a very similar problem to
AttributeError when using ColumnTransformer into a pipeline
As described there, it is recommended to use pandas for your encoding (there is also an example for one-hot-encoding). I hope that helps!
Solution 4:
Try to convert your list list_of_val
into a numpy array by running
import numpy as np
list_of_val = np.asarray(list_of_val)
Post a Comment for "Outlier Prediction With Categorical Data In Pythons Scikit-Learn Lib"