Skip to content Skip to sidebar Skip to footer

Finding Top 5 Values Based On Another Column Pandas

How to find top 5 values of category column based while grouping customer_id column in pandas dataframe? customer_id email address_id name

Solution 1:

Try value_counts + groupby nlargest to get get the highest frequency categories, then groupby aggregate to convert to a string, then join to merge back with the original DataFrame:

n = 2
df = df.join(
    df.value_counts(['customer_id', 'category'])
        .groupby(level=0).nlargest(n)
        .reset_index('category')
        .groupby(level=0)['category'].agg(', '.join)
        .rename('preferred_film_category'),
    on='customer_id'
)

df:

    customer_id                              email  address_id            name     category preferred_film_category
0411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER        Scifi           Action, Scifi
1411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER       Action           Action, Scifi
2411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER       Sports           Action, Scifi
3411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER        Scifi           Action, Scifi
4411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER       Family           Action, Scifi
5411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER       Action           Action, Scifi
6527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN  Documentary     Documentary, Sports
7527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN       Action     Documentary, Sports
8527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN       Sports     Documentary, Sports
9527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN        Scifi     Documentary, Sports
10527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN  Documentary     Documentary, Sports
11527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN       Sports     Documentary, Sports

*note n is set to 2 as each customer only has 4 unique values in categrory and so 5 does not demonstrate the functioning of the code. Change this to the desired value to keep (5).


DataFrame Used:

df = pd.DataFrame({
    'customer_id': [411, 411, 411, 411, 411, 411, 527, 527, 527, 527, 527, 527],
    'email': ['NORMAN.CURRIER@sakilacustomer.org',
              'NORMAN.CURRIER@sakilacustomer.org',
              'NORMAN.CURRIER@sakilacustomer.org',
              'NORMAN.CURRIER@sakilacustomer.org',
              'NORMAN.CURRIER@sakilacustomer.org',
              'NORMAN.CURRIER@sakilacustomer.org',
              'CORY.MEEHAN@sakilacustomer.org',
              'CORY.MEEHAN@sakilacustomer.org',
              'CORY.MEEHAN@sakilacustomer.org',
              'CORY.MEEHAN@sakilacustomer.org',
              'CORY.MEEHAN@sakilacustomer.org',
              'CORY.MEEHAN@sakilacustomer.org'],
    'address_id': [416, 416, 416, 416, 416, 416, 533, 533, 533, 533, 533, 533],
    'name': ['NORMAN CURRIER', 'NORMAN CURRIER', 'NORMAN CURRIER',
             'NORMAN CURRIER', 'NORMAN CURRIER', 'NORMAN CURRIER',
             'CORY MEEHAN', 'CORY MEEHAN', 'CORY MEEHAN', 'CORY MEEHAN',
             'CORY MEEHAN', 'CORY MEEHAN'],
    'category': ['Scifi', 'Action', 'Sports', 'Scifi', 'Family', 'Action',
                 'Documentary', 'Action', 'Sports', 'Scifi', 'Documentary',
                 'Sports']
})

df:

    customer_id                              email  address_id            name     category
0411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER        Scifi
1411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER       Action
2411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER       Sports
3411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER        Scifi
4411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER       Family
5411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER       Action
6527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN  Documentary
7527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN       Action
8527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN       Sports
9527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN        Scifi
10527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN  Documentary
11527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN       Sports

Post a Comment for "Finding Top 5 Values Based On Another Column Pandas"