Computing Separate Tfidf Scores For Two Different Columns Using Sklearn
I'm trying to compute the similarity between a set of queries and a set a result for each query. I would like to do this using tfidf scores and cosine similarity. The issue that
Solution 1:
You've made a good start by just putting all the words together; often a simple pipeline such as this will be enough to produce good results. You can build more complex feature processing pipelines using pipeline
and preprocessing
. Here's how it would work for your data:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import FeatureUnion, Pipeline
df_all = pd.DataFrame({'search_term':['hat','cat'],
'product_title':['hat stand','cat in hat']})
transformer = FeatureUnion([
('search_term_tfidf',
Pipeline([('extract_field',
FunctionTransformer(lambda x: x['search_term'],
validate=False)),
('tfidf',
TfidfVectorizer())])),
('product_title_tfidf',
Pipeline([('extract_field',
FunctionTransformer(lambda x: x['product_title'],
validate=False)),
('tfidf',
TfidfVectorizer())]))])
transformer.fit(df_all)
search_vocab = transformer.transformer_list[0][1].steps[1][1].get_feature_names()
product_vocab = transformer.transformer_list[1][1].steps[1][1].get_feature_names()
vocab = search_vocab + product_vocab
print(vocab)
print(transformer.transform(df_all).toarray())
['cat', 'hat', 'cat', 'hat', 'in', 'stand']
[[ 0. 1. 0. 0.57973867 0. 0.81480247]
[ 1. 0. 0.6316672 0.44943642 0.6316672 0. ]]
Post a Comment for "Computing Separate Tfidf Scores For Two Different Columns Using Sklearn"