Slicing Pandas Dataframe According To Number Of Lines
Solution 1:
I would use .groupby()
+ .filter()
methods:
In [66]: df.groupby('Group').filter(lambda x: len(x) >= 4 and x['SomeQuantity'].max() >= 20)
Out[66]:
Group Id_In_Group SomeQuantity
3 3 1 16
4 3 2 22
5 3 3 5
6 3 4 12
7 3 5 28
Solution 2:
A slightly different approach using map
, value_counts
, groupby
, filter
:
(df[df.Group.map(df.Group.value_counts().ge(4))]
.groupby('Group')
.filter(lambda x: np.any(x['SomeQuantity'].sort_values().iloc[3] >= 20)))
Breakdown of steps:
Perform value_counts
to compute the total counts of distinct elements present in Group column.
>>> df.Group.value_counts()
3 5
4 4
1 2
5 1
2 1
Name: Group, dtype: int64
Use map
which functions like a dictionary (wherein the index becomes the keys and the series elements become the values) to map these results back to the original DF
>>> df.Group.map(df.Group.value_counts())
0 2
1 2
2 1
3 5
4 5
5 5
6 5
7 5
8 4
9 4
10 4
11 4
12 1
Name: Group, dtype: int64
Then, we check for the elements having a value of 4 or more which is our threshold limit and take only those subset from the entire DF
.
>>> df[df.Group.map(df.Group.value_counts().ge(4))]
Group Id_In_Group SomeQuantity
3 3 1 16
4 3 2 22
5 3 3 5
6 3 4 12
7 3 5 28
8 4 1 1
9 4 2 28
10 4 3 14
11 4 4 7
Inorder to use groupby.filter
operation on this, we must make sure that we return a single boolean value corresponding to each grouped key when we perform the sorting process and compare the fourth element to the threshold which is 20.
np.any
returns all such possiblities matching our filter.
>>> df[df.Group.map(df.Group.value_counts().ge(4))] \
.groupby('Group').apply(lambda x: x['SomeQuantity'].sort_values().iloc[3])
Group
3 22
4 18
dtype: int64
From these, we compare the fourth element .iloc[3]
as it is 0-based indexed and return all such favourable matches.
Solution 3:
This is how I have worked through your question, warts and all. Im sure there are much nicer ways to do this.
Find groups with "4 objects in the group"
import collections
groups = list({k for k, v in collections.Counter(df.Group).items() if v > 3} );groups
Out:[3, 4]
Use these groups to filter to a new df containing these groups:
df2 = df[df.Group.isin(groups)]
"4th SomeQuantity (after sorting) is 22 (>=20)"
df3 = df2.sort_values(by='SomeQuantity',ascending=False)
(Updated as per comment below...)
df3.groupby('Group').filter(lambda grp: any(grp.sort_values('SomeQuantity').iloc[3] >= 20)).sort_index()
Group Id_In_Group SomeQuantity
3 3 1 16
4 3 2 22
5 3 3 5
6 3 4 12
7 3 5 28
Post a Comment for "Slicing Pandas Dataframe According To Number Of Lines"