Skip to content Skip to sidebar Skip to footer

Split The Data Frame Based On Consecutive Row Values Differences

I have a data frame like this, df col1 col2 col3 1 2 3 2 5 6 7 8 9 10 11 12 11 12 13 13 14 15 14 15

Solution 1:

You could define a custom grouper by taking the diff, checking when it is greater than 1, and take the cumsum of the boolean series. Then group by the result and build a dictionary from the groupby object:

d = dict(tuple(df.groupby(df.col1.diff().gt(1).cumsum())))

print(d[0])
   col1  col2  col3
0     1     2     3
1     2     5     6

print(d[1])
   col1  col2  col3
2     7     8     9

A more detailed break-down:

df.assign(difference=(diff:=df.col1.diff()), 
          condition=(gt1:=diff.gt(1)), 
          grouper=gt1.cumsum())

   col1  col2  col3  difference  condition  grouper
0     1     2     3         NaN      False        0
1     2     5     6         1.0      False        0
2     7     8     9         5.0       True        1
3    10    11    12         3.0       True        2
4    11    12    13         1.0      False        2
5    13    14    15         2.0       True        3
6    14    15    16         1.0      False        3

Solution 2:

You can also peel off the target column and work with it as a series, rather than the above answer. That keeps everything smaller. It runs faster on the example, but I don't know how they'll scale up, depending how many times you're splitting.

row_bool = df['col1'].diff()>1
split_inds, = np.where(row_bool)
split_inds = np.insert(arr=split_inds, obj=[0,len(split_inds)], values=[0,len(df)])

df_tup = ()
for n in range(0,len(split_inds)-1):
    tempdf = df.iloc[split_inds[n]:split_inds[n+1],:]
    df_tup.append(tempdf)

(Just throwing it in a tuple of dataframes afterward, but the dictionary approach might be better?)


Post a Comment for "Split The Data Frame Based On Consecutive Row Values Differences"