Finding Top 5 Values Based On Another Column Pandas
How to find top 5 values of category column based while grouping customer_id column in pandas dataframe? customer_id email address_id name
Solution 1:
Try value_counts
+ groupby nlargest
to get get the highest frequency categories, then groupby aggregate
to convert to a string, then join
to merge back with the original DataFrame:
n = 2
df = df.join(
df.value_counts(['customer_id', 'category'])
.groupby(level=0).nlargest(n)
.reset_index('category')
.groupby(level=0)['category'].agg(', '.join)
.rename('preferred_film_category'),
on='customer_id'
)
df
:
customer_id email address_id name category preferred_film_category
0411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Scifi Action, Scifi
1411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Action Action, Scifi
2411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Sports Action, Scifi
3411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Scifi Action, Scifi
4411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Family Action, Scifi
5411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Action Action, Scifi
6527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Documentary Documentary, Sports
7527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Action Documentary, Sports
8527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Sports Documentary, Sports
9527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Scifi Documentary, Sports
10527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Documentary Documentary, Sports
11527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Sports Documentary, Sports
*note n
is set to 2
as each customer only has 4 unique values in categrory
and so 5
does not demonstrate the functioning of the code. Change this to the desired value to keep (5
).
DataFrame Used:
df = pd.DataFrame({
'customer_id': [411, 411, 411, 411, 411, 411, 527, 527, 527, 527, 527, 527],
'email': ['NORMAN.CURRIER@sakilacustomer.org',
'NORMAN.CURRIER@sakilacustomer.org',
'NORMAN.CURRIER@sakilacustomer.org',
'NORMAN.CURRIER@sakilacustomer.org',
'NORMAN.CURRIER@sakilacustomer.org',
'NORMAN.CURRIER@sakilacustomer.org',
'CORY.MEEHAN@sakilacustomer.org',
'CORY.MEEHAN@sakilacustomer.org',
'CORY.MEEHAN@sakilacustomer.org',
'CORY.MEEHAN@sakilacustomer.org',
'CORY.MEEHAN@sakilacustomer.org',
'CORY.MEEHAN@sakilacustomer.org'],
'address_id': [416, 416, 416, 416, 416, 416, 533, 533, 533, 533, 533, 533],
'name': ['NORMAN CURRIER', 'NORMAN CURRIER', 'NORMAN CURRIER',
'NORMAN CURRIER', 'NORMAN CURRIER', 'NORMAN CURRIER',
'CORY MEEHAN', 'CORY MEEHAN', 'CORY MEEHAN', 'CORY MEEHAN',
'CORY MEEHAN', 'CORY MEEHAN'],
'category': ['Scifi', 'Action', 'Sports', 'Scifi', 'Family', 'Action',
'Documentary', 'Action', 'Sports', 'Scifi', 'Documentary',
'Sports']
})
df
:
customer_id email address_id name category
0411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Scifi
1411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Action
2411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Sports
3411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Scifi
4411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Family
5411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Action
6527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Documentary
7527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Action
8527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Sports
9527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Scifi
10527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Documentary
11527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Sports
Post a Comment for "Finding Top 5 Values Based On Another Column Pandas"