Python, Pandas: How To Club Rows Rows Of Dataframe Based On Start Time And End Time

February 12, 2023 Post a Comment

So I have a dataframe like this: Start Time End Time IDs 15:02:13 15:10:24 BAMB30 19:46:19 19:46:29 BHI110 19:47:01 19:57:04 BHI110 19:47:01

Solution 1:

To present a more instructive example, I took the following source DataFrame:

   Start Time End Time     IDs  C1  C2
0    15:02:13 15:10:24  BAMB30  X9  Y9
1    19:46:19 19:46:29  BHI110  X9  Y9
2    19:47:01 19:57:04  BHI110  D2  F2
3    19:47:01 19:56:58  BHI110  D2  E2
4    19:47:01 19:56:59  BHI110  D2  E2
5    20:00:02 20:20:00  BHI110  G3  H3
6    20:01:03 20:21:16  BHI110  G3  H3
7    20:15:00 20:23:20  BHI110  X9  Y9
8    12:01:46 12:06:30  AKB286  A1  B1
9    12:02:48 12:06:50  AKB286  A1  B1
10   12:02:50 12:06:55  AKB286  A1  C1

I added C1 and C2 columns (to be compared on equality within the current group), according to your comment.

Since both Start Time and End Time columns are of string type, the first step is to convert them to Timedelta:

df['Start Time'] = df['Start Time'].apply(pd.Timedelta)
df['End Time'] = df['End Time'].apply(pd.Timedelta)

Then I defined a size limit for a group to be considered as an Event. You wrote about this limit == 5, but since both your and my data sample contain only smaller groups, I set this limit at 2:

sizeLimit = 2

Of course, running my code on your real data, change this limit to whatever you need.

Then define a function to check the "time delta" between the current row and the "starting row" and generate "event numbers":

def tDlt(row):
    st, et, c1, c2 = row[['Start Time', 'End Time', 'C1', 'C2']]
    if tDlt.start is None:
        tDlt.start, tDlt.end, tDlt.ev, tDlt.c1, tDlt.c2 = st, et, 0, c1, c2
    else:
        if ((st - tDlt.start).total_seconds() > 120)\
                or ((et - tDlt.end).total_seconds() > 120)\
                or (c1 != tDlt.c1) or (c2 != tDlt.c2):
            tDlt.start, tDlt.end, tDlt.c1, tDlt.c2 = st, et, c1, c2
            tDlt.ev += 1
    return tDlt.ev

Due to usage of its internal attributes, it is a "function with memory", keeping in start and end attributes respective values from the previous row and ev attribute - the event number.

This function will be applied to each row of the DataFrame, but before it, its start attribute will be set to None, to provide proper dealing with the first row.

Note that the starting row is set:

on the first row (when tDlt.start is None),
on each row "too distant in time" from the "starting" row or with C1 or C2 different from th e"starting" row.

This function generates consecutive "event numbers":

starting from 0,
increased whenever the any condition to continue the current group has not been met,
for all groups, even those below the size limit.

The main processing runs as follows:

Set the "initial value" in start attribute of tDlt function:
```
 tDlt.start = None
```
Sort df and apply tDlt to each row:
```
 ev = df.sort_values(['Start Time', 'End Time']).apply(tDlt, axis=1)
```
The result (for my data sample) is:
```
 8     0
 9     0
 10    1
 0     2
 1     3
 3     4
 4     4
 2     5
 5     6
 6     6
 7     7
 dtype: int64
```
Of course, the row order is different, due to the sort before the application.

Check e.g. rows with indices 3, 4 and 2. Row 3 is the earliest from this group. Row 4 is within the same group (all conditions met). But row 2 has different value in C2 column, so it starts a new group.

The next step is to cancel group numbers for "too small" groups:

 ev = ev.groupby(ev).transform(lambda grp: str(grp.iloc[0])
     if grp.size >= sizeLimit else '')

Steps:

take each group (by value) and check its size,
if it has at least sizeLimit rows, return the original group number, but as a string (for each row),
otherwise return an empty string (also for each row) - the actual cancellation.

The result is:

 8     0
 9     0
 10     
 0      
 1      
 3     4
 4     4
 2      
 5     6
 6     6
 7      
 dtype: object

Now perform "initial filling" of the new column:
```
 df['Event'] = ev[ev != ''].groupby(ev, sort=False).ngroup() + 1
```
Steps:
- Take non-empty elements from ev.
- Group them (by their value).
- Return the "global" group number, starting from 1. Note that the "initial group numbers" (computed so far) are here changed into consecutive numbers.
But this is not the final content yet (print df at this stage), because:
- cells for "too short" groups contain NaN,
- elements are of float type.

To get rid of the above deficiencies, run:

 df.Event.replace(np.nan, '', inplace=True)

The final result, sorted by both times, is:

Baca Juga

   Start Time End Time     IDs  C1  C2 Event
8    12:01:46 12:06:30  AKB286  A1  B1     1
9    12:02:48 12:06:50  AKB286  A1  B1     1
10   12:02:50 12:06:55  AKB286  A1  C1      
0    15:02:13 15:10:24  BAMB30  X9  Y9      
1    19:46:19 19:46:29  BHI110  X9  Y9      
3    19:47:01 19:56:58  BHI110  D2  E2     2
4    19:47:01 19:56:59  BHI110  D2  E2     2
2    19:47:01 19:57:04  BHI110  D2  F2      
5    20:00:02 20:20:00  BHI110  G3  H3     3
6    20:01:03 20:21:16  BHI110  G3  H3     3
7    20:15:00 20:23:20  BHI110  X9  Y9

As you can see:

First 2 rows are close enough in time, so the are event 1.
The third row has Start Time too distant from the previous row, so it has not been included in the above group.
It is also too distant from the next row, so they can not be grouped in an event.
Rows 3 and 4 form the next group.
Row 2 is excluded, due to different value in C2.
And so on.

Most likely, your both columns to be checked for equality can have other names, so put their actual names in tDlt instead of C1 and C2.

Edit following comment about grouping by IDs

Change the function to:

def tDlt(row):
    id, st, et, c1, c2 = row[['IDs', 'Start Time', 'End Time', 'C1', 'C2']]
    if tDlt.start is None:
        tDlt.id, tDlt.start, tDlt.end, tDlt.ev, tDlt.c1, tDlt.c2 = id, st, et, 0, c1, c2
    else:
        if id != tDlt.id\
                or ((st - tDlt.start).total_seconds() > 120)\
                or ((et - tDlt.end).total_seconds() > 120)\
                or (c1 != tDlt.c1) or (c2 != tDlt.c2):
            tDlt.id, tDlt.start, tDlt.end, tDlt.c1, tDlt.c2 = id, st, et, c1, c2
            tDlt.ev += 1
    return tDlt.ev

After tDlt.start = None change the next instruction to:

ev = df.sort_values(['IDs', 'Start Time', 'End Time']).apply(tDlt, axis=1)

I suppose that you added an internal attribute for IDs, but forgot to sort the source DataFrame on IDs.

To test this code, I added 2 rows:

11   20:00:02   20:20:00  XXX110  G3  H3
12   20:01:03   20:21:16  XXX110  G3  H3

Note that they are just like rows with indices 5 and 6, but there is different IDs.

The result, for such extended data, sorted on IDs, Start Time and End Time is:

   Start Time End Time     IDs  C1  C2 Event
8    12:01:46 12:06:30  AKB286  A1  B1     1
9    12:02:48 12:06:50  AKB286  A1  B1     1
10   12:02:50 12:06:55  AKB286  A1  C1      
0    15:02:13 15:10:24  BAMB30  X9  Y9      
1    19:46:19 19:46:29  BHI110  X9  Y9      
3    19:47:01 19:56:58  BHI110  D2  E2     2
4    19:47:01 19:56:59  BHI110  D2  E2     2
2    19:47:01 19:57:04  BHI110  D2  F2      
5    20:00:02 20:20:00  BHI110  G3  H3     3
6    20:01:03 20:21:16  BHI110  G3  H3     3
7    20:15:00 20:23:20  BHI110  X9  Y9      
11   20:00:02 20:20:00  XXX110  G3  H3     4
12   20:01:03 20:21:16  XXX110  G3  H3     4

so rows from XXX110 are members of a separate event.

Python Guru

Python, Pandas: How To Club Rows Rows Of Dataframe Based On Start Time And End Time

Solution 1:

Edit following comment about grouping by IDs

Post a Comment for "Python, Pandas: How To Club Rows Rows Of Dataframe Based On Start Time And End Time"