Skip to content Skip to sidebar Skip to footer

Summary Calculations On A Pandas Dataframe

I have a DF that looks like the bottom (excerpt, there are 4 regions and the dates expand each quarter) I want to create a df (by region) with just the difference between the newe

Solution 1:

see here for resolution and discussion: Selecting a new dataframe via a multi-indexed frame in Pandas using index names

Basically all you need is for a diff from prior period

df.groupby(level='region').apply(lambda x: x.diff().iloc[-1])

and for a diff from a year ago (4 quarters)

df.groupby(level='region').apply(lambda x: x.diff(4).iloc[-1])

Solution 2:

I think you are somewhat on the right track. In my mind I would make a function that calculates the two values you are looking for and returns a data frame. Something like the following:

def find_diffs(region):
    score_cols = ['Score1', 'Score2']

    most_recent_date = region.Quradate.max()
    last_quarter = most_recent_date - datetime.timedelta(365/4) # shift by 4 months
    last_year = most_recent_date - datetime.timedelta(365) # shift by a year

    quarter_score_diff = region[region.Quradate == most_recent_date OR region.Quradate == last_quarter)].diff()
    quarter_score_diff['id'] = 'quarter_diff'

    year_score_diff = region[region.Quradate == most_recent_date OR region.Quradate == last_year)].diff()
    year_score_diff['id'] = 'year_diff'

    df_temp = quarter_score_diff.append(year_score_diff)
    return df_temp

Then you can:

DF.groupby(['region']).apply(find_diffs)

The result will be a DF indexed by region with columns for each score difference and an additional column that identifies each row as a quarter or yearly difference.

Solution 3:

Writing a function to then use with groupby is definitely an option, one other thing that is easy to do is to make lists of the data in the groups and use the indeces to make your calculations which is possible due to the regular spaced nature of your data (and bear in mind this only works if the data are regularly spaced). This method gets around having to really work with the dates at all. Firstly I would reindex so that region appears in the dataframe as columns, then I would do the following:

#First I create some data
Dates = pd.date_range('2010-1-1', periods = 14, freq = 'Q')
Regions = ['Western', 'Eastern', 'Southern', 'Norhtern']
df = DataFrame({'Regions': [elem for elem in Regions for x inrange(14)], \
            'Score1' : np.random.rand(56), 'Score2' : np.random.rand(56), 'Score3' : np.random.rand(56), \
            'Score4' : np.random.rand(56), 'Score5' : np.random.rand(56)}, index = list(Dates)*4)

# Create a dictionary to hold your data
SCORES = ['Score1', 'Score2', 'Score3', 'Score4', 'Score5']
ValuesDict = {region : {score : [int(), int()] for score in SCORES} for region in df.Regions.unique()}

#This dictionary will contain keys that are your regions, and these will correspond to a dictionary that has keys that are your scores and those correspond to a list of which the fisrt element is the most recent - last quarter calculation, and the second is the most recent - last year calcuation. #Now group the data
dfGrouped = df.groupby('Regions')

#Now iterate through the groups creating lists of the underlying data. The data that is at the last index point of the list is by definition the newest (due to the sorting when grouping) and the obervation one year previous to that is - 4 index points away.for group in dfGrouped:
    Score1List = list(group[1].Score1)
    Score2List = list(group[1].Score2)
    Score3List = list(group[1].Score3)
    Score4List = list(group[1].Score4)
    Score5List = list(group[1].Score5)
    MasterList = [Score1List, Score2List, Score3List, Score4List, Score5List]
    for x in xrange(1, 6):
        ValuesDict[group[0]]['Score' + str(x)][0] = MasterList[x-1][-1] - MasterList[x-1][-2]
        ValuesDict[group[0]]['Score' + str(x)][1] = MasterList[x-1][-1] - MasterList[x-1][-5]

ValuesDict

Its a bit convoluted, but this is the way I often approach these types of problems. Values dict contains all the data you need, but I'm having difficulty getting it into a dataframe.

Post a Comment for "Summary Calculations On A Pandas Dataframe"