Skip to content Skip to sidebar Skip to footer

Count Words In A Column Of Strings In Pandas

I have a pandas dataframe that contains queries and counts for a given time period and I'm hoping to turn this dataframe into a count of unique words. For example, if the dataframe

Solution 1:

Option 1

df['query'].str.get_dummies(sep=' ').T.dot(df['count'])

bar      12
foo      16
super    10
dtype: int64

Option 2

df['query'].str.get_dummies(sep=' ').mul(df['count'], axis=0).sum()

bar      12
foo      16
super    10
dtype: int64

Option 3numpy.bincount + pd.factorize also highlighting the use of cytoolz.mapcat. It returns an iterator where it maps a function and concatenates the results. That's cool!

import pandas as pd, numpy as np, cytoolz

q = df['query'].values
c = df['count'].values

f, u = pd.factorize(list(cytoolz.mapcat(str.split, q.tolist())))
l = np.core.defchararray.count(q.astype(str), ' ') + 1

pd.Series(np.bincount(f, c.repeat(l)).astype(int), u)

foo      16
bar      12super10
dtype: int64

Option 4 Absurd use of stuff... just use option 1.

pd.DataFrame(dict(
    query=' '.join(df['query']).split(),
    count=df['count'].repeat(df['query'].str.count(' ') + 1)
)).groupby('query')['count'].sum()

query
bar      12
foo      16
super    10
Name: count, dtype: int64

Solution 2:

Just another alternative with melt + groupby + sum:

df['query'].str.split(expand=True).assign(count=df['count'])\
                          .melt('count').groupby('value')['count'].sum()

value
bar      12
foo      16
super    10
Name: count, dtype: int64

Post a Comment for "Count Words In A Column Of Strings In Pandas"