Skip to content Skip to sidebar Skip to footer

Slicing Pandas Dataframe According To Number Of Lines

I suppose this is something rather simple, but I can't find how to make this. I've been searching tutorials and stackoverflow. Suppose I have a dataframe df loking like this : Grou

Solution 1:

I would use .groupby() + .filter() methods:

In [66]: df.groupby('Group').filter(lambda x: len(x) >= 4 and x['SomeQuantity'].max() >= 20)
Out[66]:
   Group  Id_In_Group  SomeQuantity
3      3            1            16
4      3            2            22
5      3            3             5
6      3            4            12
7      3            5            28

Solution 2:

A slightly different approach using map, value_counts, groupby , filter:

(df[df.Group.map(df.Group.value_counts().ge(4))]
   .groupby('Group')
   .filter(lambda x: np.any(x['SomeQuantity'].sort_values().iloc[3] >= 20)))

enter image description here


Breakdown of steps:

Perform value_counts to compute the total counts of distinct elements present in Group column.

>>> df.Group.value_counts()

3    5
4    4
1    2
5    1
2    1
Name: Group, dtype: int64

Use map which functions like a dictionary (wherein the index becomes the keys and the series elements become the values) to map these results back to the original DF

>>> df.Group.map(df.Group.value_counts())

0     2
1     2
2     1
3     5
4     5
5     5
6     5
7     5
8     4
9     4
10    4
11    4
12    1
Name: Group, dtype: int64

Then, we check for the elements having a value of 4 or more which is our threshold limit and take only those subset from the entire DF.

>>> df[df.Group.map(df.Group.value_counts().ge(4))]   

    Group  Id_In_Group  SomeQuantity
3       3            1            16
4       3            2            22
5       3            3             5
6       3            4            12
7       3            5            28
8       4            1             1
9       4            2            28
10      4            3            14
11      4            4             7

Inorder to use groupby.filter operation on this, we must make sure that we return a single boolean value corresponding to each grouped key when we perform the sorting process and compare the fourth element to the threshold which is 20. np.any returns all such possiblities matching our filter.

>>> df[df.Group.map(df.Group.value_counts().ge(4))]         \
      .groupby('Group').apply(lambda x: x['SomeQuantity'].sort_values().iloc[3])

 Group
3    22
4    18
dtype: int64

From these, we compare the fourth element .iloc[3] as it is 0-based indexed and return all such favourable matches.


Solution 3:

This is how I have worked through your question, warts and all. Im sure there are much nicer ways to do this.

Find groups with "4 objects in the group"

import collections

groups = list({k for k, v in collections.Counter(df.Group).items() if v > 3} );groups

Out:[3, 4]

Use these groups to filter to a new df containing these groups:

df2 = df[df.Group.isin(groups)]

"4th SomeQuantity (after sorting) is 22 (>=20)"

 df3 = df2.sort_values(by='SomeQuantity',ascending=False)

(Updated as per comment below...)

df3.groupby('Group').filter(lambda grp: any(grp.sort_values('SomeQuantity').iloc[3] >= 20)).sort_index()

    Group   Id_In_Group SomeQuantity
  3    3        1       16
  4    3        2       22
  5    3        3       5
  6    3        4       12
  7    3        5       28

Post a Comment for "Slicing Pandas Dataframe According To Number Of Lines"