Computing Separate Tfidf Scores For Two Different Columns Using Sklearn

July 07, 2022 Post a Comment

I'm trying to compute the similarity between a set of queries and a set a result for each query. I would like to do this using tfidf scores and cosine similarity. The issue that

Solution 1:

You've made a good start by just putting all the words together; often a simple pipeline such as this will be enough to produce good results. You can build more complex feature processing pipelines using pipeline and preprocessing. Here's how it would work for your data:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import FeatureUnion, Pipeline

df_all = pd.DataFrame({'search_term':['hat','cat'], 
                       'product_title':['hat stand','cat in hat']})

transformer = FeatureUnion([
                ('search_term_tfidf', 
                  Pipeline([('extract_field',
                              FunctionTransformer(lambda x: x['search_term'], 
                                                  validate=False)),
                            ('tfidf', 
                              TfidfVectorizer())])),
                ('product_title_tfidf', 
                  Pipeline([('extract_field', 
                              FunctionTransformer(lambda x: x['product_title'], 
                                                  validate=False)),
                            ('tfidf', 
                              TfidfVectorizer())]))]) 

transformer.fit(df_all)

search_vocab = transformer.transformer_list[0][1].steps[1][1].get_feature_names() 
product_vocab = transformer.transformer_list[1][1].steps[1][1].get_feature_names()
vocab = search_vocab + product_vocab

print(vocab)
print(transformer.transform(df_all).toarray())

['cat', 'hat', 'cat', 'hat', 'in', 'stand']

[[ 0.          1.          0.          0.57973867  0.          0.81480247]
 [ 1.          0.          0.6316672   0.44943642  0.6316672   0.        ]]

Baca Juga

Python Guru

Computing Separate Tfidf Scores For Two Different Columns Using Sklearn

Solution 1:

Post a Comment for "Computing Separate Tfidf Scores For Two Different Columns Using Sklearn"