Skip to content Skip to sidebar Skip to footer

Computing Separate Tfidf Scores For Two Different Columns Using Sklearn

I'm trying to compute the similarity between a set of queries and a set a result for each query. I would like to do this using tfidf scores and cosine similarity. The issue that

Solution 1:

You've made a good start by just putting all the words together; often a simple pipeline such as this will be enough to produce good results. You can build more complex feature processing pipelines using pipeline and preprocessing. Here's how it would work for your data:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import FeatureUnion, Pipeline

df_all = pd.DataFrame({'search_term':['hat','cat'], 
                       'product_title':['hat stand','cat in hat']})

transformer = FeatureUnion([
                ('search_term_tfidf', 
                  Pipeline([('extract_field',
                              FunctionTransformer(lambda x: x['search_term'], 
                                                  validate=False)),
                            ('tfidf', 
                              TfidfVectorizer())])),
                ('product_title_tfidf', 
                  Pipeline([('extract_field', 
                              FunctionTransformer(lambda x: x['product_title'], 
                                                  validate=False)),
                            ('tfidf', 
                              TfidfVectorizer())]))]) 

transformer.fit(df_all)

search_vocab = transformer.transformer_list[0][1].steps[1][1].get_feature_names() 
product_vocab = transformer.transformer_list[1][1].steps[1][1].get_feature_names()
vocab = search_vocab + product_vocab

print(vocab)
print(transformer.transform(df_all).toarray())

['cat', 'hat', 'cat', 'hat', 'in', 'stand']

[[ 0.          1.          0.          0.57973867  0.          0.81480247]
 [ 1.          0.          0.6316672   0.44943642  0.6316672   0.        ]]

Post a Comment for "Computing Separate Tfidf Scores For Two Different Columns Using Sklearn"