Skip to content Skip to sidebar Skip to footer

Groupby Of Splitted Data (pandas)

Imagine you have a large CSV file with several million rows that you process by chunks. The file is too large to be loaded in memory. What would be the best way to do groupby and a

Solution 1:

You need to set up a way to remember the last fill value. I use the dictionary memory below

memory = {}

def fill(df):
    name = df.name
    df = df.copy()

    # fill from memoryif name in memory.keys():
        df.iloc[0, :] = df.iloc[0, :].fillna(memory[name])

    # normal ffilldf = df.fillna(method='ffill')

    # update memory
    memory.update({name: df.iloc[-1]})

    returndf

memory

{}

A = pd.DataFrame({"ID":["A", "A", "C" ,"B", "A"], "value":[3,np.nan,4,5,np.nan]})
A

enter image description here

Now I'll updateA for only the first 4 rows

A.update(A.iloc[:4].groupby('ID', group_keys=False).apply(fill))
A

enter image description here

Notice that only the value in row 1 was filled. Row 4 was left alone. However, let's look at memory

memory

{'A':IDAvalue3Name:1, dtype:object, 'B':IDBvalue5Name:3, dtype:object, 'C':IDCvalue4Name:2, dtype:object}

Or more specifically memory['A']

IDAvalue3Name: 1, dtype: object

So let's now update A for only row 4

A.update(A.iloc[4:].groupby('ID', group_keys=False).apply(fill))
A

enter image description here

Solution 2:

I guess you want to read in chunk-by-chunk and then write to disk after processing. I think @piRSquared's idea of "keeping memory of previously seen values" should work if the function you want to apply is ffill, though I'm sure @Jeff is right about Dask (which I'm not familiar).

I have made up a slightly longer file for testing. See below.

inputcsv = 'test.csv'
outputcsv = 'test.output.csv'
chunksize = 4

outfh = open(outputcsv, 'wb')
memory = None
len_memory = 0#write file header to output file
pd.read_csv(inputcsv, nrows=0).to_csv(outfh, index=False)

for chunk in pd.read_csv(inputcsv, chunksize=chunksize):
    if memory isnotNone:
        len_memory = len(memory)
        #put memory in front of chunk
        chunk = pd.concat([memory.reset_index(), chunk], ignore_index=True)
        #ffill
    chunk['value'] = chunk.groupby('ID')['value'].fillna(method='ffill')
    #update memory
    memory = chunk.groupby('ID').last().dropna()
    #The first len_memory was from memory not input file. Get rid of them.
    chunk = chunk.iloc[len_memory:,:]
    chunk.to_csv(outfh, index=False, header=False)
outfh.close()

print pd.read_csv(inputcsv)
   ID  value
0   A    3.01   A    NaN
2   C    4.03   B    5.04   A    NaN
5   F    2.06   D    2.07   A    1.08   C    NaN
9   B    3.010  E    NaN
11  D    4.012  A    NaN
13  B    NaN
14  B    5.015  C    NaN
16  E    4.017  F    NaN
18  F    1.019  E    0.0print pd.read_csv(outputcsv)
   ID  value
0   A    3.01   A    3.02   C    4.03   B    5.04   A    3.05   F    2.06   D    2.07   A    1.08   C    4.09   B    3.010  E    NaN
11  D    4.012  A    1.013  B    3.014  B    5.015  C    4.016  E    4.017  F    2.018  F    1.019  E    0.0

Post a Comment for "Groupby Of Splitted Data (pandas)"