Groupby Of Splitted Data (pandas)
Imagine you have a large CSV file with several million rows that you process by chunks. The file is too large to be loaded in memory. What would be the best way to do groupby and a
Solution 1:
You need to set up a way to remember the last fill value. I use the dictionary memory
below
memory = {}
def fill(df):
name = df.name
df = df.copy()
# fill from memoryif name in memory.keys():
df.iloc[0, :] = df.iloc[0, :].fillna(memory[name])
# normal ffilldf = df.fillna(method='ffill')
# update memory
memory.update({name: df.iloc[-1]})
returndf
memory
{}
A = pd.DataFrame({"ID":["A", "A", "C" ,"B", "A"], "value":[3,np.nan,4,5,np.nan]})
A
Now I'll update
A
for only the first 4 rows
A.update(A.iloc[:4].groupby('ID', group_keys=False).apply(fill))
A
Notice that only the value in row 1 was filled. Row 4 was left alone. However, let's look at memory
memory
{'A':IDAvalue3Name:1, dtype:object, 'B':IDBvalue5Name:3, dtype:object, 'C':IDCvalue4Name:2, dtype:object}
Or more specifically memory['A']
IDAvalue3Name: 1, dtype: object
So let's now update A
for only row 4
A.update(A.iloc[4:].groupby('ID', group_keys=False).apply(fill))
A
Solution 2:
I guess you want to read in chunk-by-chunk and then write to disk after processing. I think @piRSquared's idea of "keeping memory of previously seen values" should work if the function you want to apply is ffill, though I'm sure @Jeff is right about Dask (which I'm not familiar).
I have made up a slightly longer file for testing. See below.
inputcsv = 'test.csv'
outputcsv = 'test.output.csv'
chunksize = 4
outfh = open(outputcsv, 'wb')
memory = None
len_memory = 0#write file header to output file
pd.read_csv(inputcsv, nrows=0).to_csv(outfh, index=False)
for chunk in pd.read_csv(inputcsv, chunksize=chunksize):
if memory isnotNone:
len_memory = len(memory)
#put memory in front of chunk
chunk = pd.concat([memory.reset_index(), chunk], ignore_index=True)
#ffill
chunk['value'] = chunk.groupby('ID')['value'].fillna(method='ffill')
#update memory
memory = chunk.groupby('ID').last().dropna()
#The first len_memory was from memory not input file. Get rid of them.
chunk = chunk.iloc[len_memory:,:]
chunk.to_csv(outfh, index=False, header=False)
outfh.close()
print pd.read_csv(inputcsv)
ID value
0 A 3.01 A NaN
2 C 4.03 B 5.04 A NaN
5 F 2.06 D 2.07 A 1.08 C NaN
9 B 3.010 E NaN
11 D 4.012 A NaN
13 B NaN
14 B 5.015 C NaN
16 E 4.017 F NaN
18 F 1.019 E 0.0print pd.read_csv(outputcsv)
ID value
0 A 3.01 A 3.02 C 4.03 B 5.04 A 3.05 F 2.06 D 2.07 A 1.08 C 4.09 B 3.010 E NaN
11 D 4.012 A 1.013 B 3.014 B 5.015 C 4.016 E 4.017 F 2.018 F 1.019 E 0.0
Post a Comment for "Groupby Of Splitted Data (pandas)"