Python, Pandas: How To Club Rows Rows Of Dataframe Based On Start Time And End Time
Solution 1:
To present a more instructive example, I took the following source DataFrame:
Start Time End Time IDs C1 C2
0 15:02:13 15:10:24 BAMB30 X9 Y9
1 19:46:19 19:46:29 BHI110 X9 Y9
2 19:47:01 19:57:04 BHI110 D2 F2
3 19:47:01 19:56:58 BHI110 D2 E2
4 19:47:01 19:56:59 BHI110 D2 E2
5 20:00:02 20:20:00 BHI110 G3 H3
6 20:01:03 20:21:16 BHI110 G3 H3
7 20:15:00 20:23:20 BHI110 X9 Y9
8 12:01:46 12:06:30 AKB286 A1 B1
9 12:02:48 12:06:50 AKB286 A1 B1
10 12:02:50 12:06:55 AKB286 A1 C1
I added C1 and C2 columns (to be compared on equality within the current group), according to your comment.
Since both Start Time and End Time columns are of string type, the first step is to convert them to Timedelta:
df['Start Time'] = df['Start Time'].apply(pd.Timedelta)
df['End Time'] = df['End Time'].apply(pd.Timedelta)
Then I defined a size limit for a group to be considered as an Event. You wrote about this limit == 5, but since both your and my data sample contain only smaller groups, I set this limit at 2:
sizeLimit = 2
Of course, running my code on your real data, change this limit to whatever you need.
Then define a function to check the "time delta" between the current row and the "starting row" and generate "event numbers":
def tDlt(row):
st, et, c1, c2 = row[['Start Time', 'End Time', 'C1', 'C2']]
if tDlt.start is None:
tDlt.start, tDlt.end, tDlt.ev, tDlt.c1, tDlt.c2 = st, et, 0, c1, c2
else:
if ((st - tDlt.start).total_seconds() > 120)\
or ((et - tDlt.end).total_seconds() > 120)\
or (c1 != tDlt.c1) or (c2 != tDlt.c2):
tDlt.start, tDlt.end, tDlt.c1, tDlt.c2 = st, et, c1, c2
tDlt.ev += 1
return tDlt.ev
Due to usage of its internal attributes, it is a "function with memory", keeping in start and end attributes respective values from the previous row and ev attribute - the event number.
This function will be applied to each row of the DataFrame, but before it, its start attribute will be set to None, to provide proper dealing with the first row.
Note that the starting row is set:
- on the first row (when tDlt.start is None),
- on each row "too distant in time" from the "starting" row or with C1 or C2 different from th e"starting" row.
This function generates consecutive "event numbers":
- starting from 0,
- increased whenever the any condition to continue the current group has not been met,
- for all groups, even those below the size limit.
The main processing runs as follows:
Set the "initial value" in start attribute of tDlt function:
tDlt.start = None
Sort df and apply tDlt to each row:
ev = df.sort_values(['Start Time', 'End Time']).apply(tDlt, axis=1)
The result (for my data sample) is:
8 0 9 0 10 1 0 2 1 3 3 4 4 4 2 5 5 6 6 6 7 7 dtype: int64
Of course, the row order is different, due to the sort before the application.
Check e.g. rows with indices 3, 4 and 2. Row 3 is the earliest from this group. Row 4 is within the same group (all conditions met). But row 2 has different value in C2 column, so it starts a new group.
The next step is to cancel group numbers for "too small" groups:
ev = ev.groupby(ev).transform(lambda grp: str(grp.iloc[0]) if grp.size >= sizeLimit else '')
Steps:
- take each group (by value) and check its size,
- if it has at least sizeLimit rows, return the original group number, but as a string (for each row),
- otherwise return an empty string (also for each row) - the actual cancellation.
The result is:
8 0 9 0 10 0 1 3 4 4 4 2 5 6 6 6 7 dtype: object
Now perform "initial filling" of the new column:
df['Event'] = ev[ev != ''].groupby(ev, sort=False).ngroup() + 1
Steps:
- Take non-empty elements from ev.
- Group them (by their value).
- Return the "global" group number, starting from 1. Note that the "initial group numbers" (computed so far) are here changed into consecutive numbers.
But this is not the final content yet (print df at this stage), because:
- cells for "too short" groups contain NaN,
- elements are of float type.
To get rid of the above deficiencies, run:
df.Event.replace(np.nan, '', inplace=True)
The final result, sorted by both times, is:
Start Time End Time IDs C1 C2 Event
8 12:01:46 12:06:30 AKB286 A1 B1 1
9 12:02:48 12:06:50 AKB286 A1 B1 1
10 12:02:50 12:06:55 AKB286 A1 C1
0 15:02:13 15:10:24 BAMB30 X9 Y9
1 19:46:19 19:46:29 BHI110 X9 Y9
3 19:47:01 19:56:58 BHI110 D2 E2 2
4 19:47:01 19:56:59 BHI110 D2 E2 2
2 19:47:01 19:57:04 BHI110 D2 F2
5 20:00:02 20:20:00 BHI110 G3 H3 3
6 20:01:03 20:21:16 BHI110 G3 H3 3
7 20:15:00 20:23:20 BHI110 X9 Y9
As you can see:
- First 2 rows are close enough in time, so the are event 1.
- The third row has Start Time too distant from the previous row, so it has not been included in the above group.
- It is also too distant from the next row, so they can not be grouped in an event.
- Rows 3 and 4 form the next group.
- Row 2 is excluded, due to different value in C2.
- And so on.
Most likely, your both columns to be checked for equality can have other names, so put their actual names in tDlt instead of C1 and C2.
Edit following comment about grouping by IDs
Change the function to:
def tDlt(row):
id, st, et, c1, c2 = row[['IDs', 'Start Time', 'End Time', 'C1', 'C2']]
if tDlt.start is None:
tDlt.id, tDlt.start, tDlt.end, tDlt.ev, tDlt.c1, tDlt.c2 = id, st, et, 0, c1, c2
else:
if id != tDlt.id\
or ((st - tDlt.start).total_seconds() > 120)\
or ((et - tDlt.end).total_seconds() > 120)\
or (c1 != tDlt.c1) or (c2 != tDlt.c2):
tDlt.id, tDlt.start, tDlt.end, tDlt.c1, tDlt.c2 = id, st, et, c1, c2
tDlt.ev += 1
return tDlt.ev
After tDlt.start = None
change the next instruction to:
ev = df.sort_values(['IDs', 'Start Time', 'End Time']).apply(tDlt, axis=1)
I suppose that you added an internal attribute for IDs, but forgot to sort the source DataFrame on IDs.
To test this code, I added 2 rows:
11 20:00:02 20:20:00 XXX110 G3 H3
12 20:01:03 20:21:16 XXX110 G3 H3
Note that they are just like rows with indices 5 and 6, but there is different IDs.
The result, for such extended data, sorted on IDs, Start Time and End Time is:
Start Time End Time IDs C1 C2 Event
8 12:01:46 12:06:30 AKB286 A1 B1 1
9 12:02:48 12:06:50 AKB286 A1 B1 1
10 12:02:50 12:06:55 AKB286 A1 C1
0 15:02:13 15:10:24 BAMB30 X9 Y9
1 19:46:19 19:46:29 BHI110 X9 Y9
3 19:47:01 19:56:58 BHI110 D2 E2 2
4 19:47:01 19:56:59 BHI110 D2 E2 2
2 19:47:01 19:57:04 BHI110 D2 F2
5 20:00:02 20:20:00 BHI110 G3 H3 3
6 20:01:03 20:21:16 BHI110 G3 H3 3
7 20:15:00 20:23:20 BHI110 X9 Y9
11 20:00:02 20:20:00 XXX110 G3 H3 4
12 20:01:03 20:21:16 XXX110 G3 H3 4
so rows from XXX110 are members of a separate event.
Post a Comment for "Python, Pandas: How To Club Rows Rows Of Dataframe Based On Start Time And End Time"