Summary Calculations On A Pandas Dataframe
Solution 1:
see here for resolution and discussion: Selecting a new dataframe via a multi-indexed frame in Pandas using index names
Basically all you need is for a diff from prior period
df.groupby(level='region').apply(lambda x: x.diff().iloc[-1])
and for a diff from a year ago (4 quarters)
df.groupby(level='region').apply(lambda x: x.diff(4).iloc[-1])
Solution 2:
I think you are somewhat on the right track. In my mind I would make a function that calculates the two values you are looking for and returns a data frame. Something like the following:
def find_diffs(region):
score_cols = ['Score1', 'Score2']
most_recent_date = region.Quradate.max()
last_quarter = most_recent_date - datetime.timedelta(365/4) # shift by 4 months
last_year = most_recent_date - datetime.timedelta(365) # shift by a year
quarter_score_diff = region[region.Quradate == most_recent_date OR region.Quradate == last_quarter)].diff()
quarter_score_diff['id'] = 'quarter_diff'
year_score_diff = region[region.Quradate == most_recent_date OR region.Quradate == last_year)].diff()
year_score_diff['id'] = 'year_diff'
df_temp = quarter_score_diff.append(year_score_diff)
return df_temp
Then you can:
DF.groupby(['region']).apply(find_diffs)
The result will be a DF indexed by region with columns for each score difference and an additional column that identifies each row as a quarter or yearly difference.
Solution 3:
Writing a function to then use with groupby is definitely an option, one other thing that is easy to do is to make lists of the data in the groups and use the indeces to make your calculations which is possible due to the regular spaced nature of your data (and bear in mind this only works if the data are regularly spaced). This method gets around having to really work with the dates at all. Firstly I would reindex so that region appears in the dataframe as columns, then I would do the following:
#First I create some data
Dates = pd.date_range('2010-1-1', periods = 14, freq = 'Q')
Regions = ['Western', 'Eastern', 'Southern', 'Norhtern']
df = DataFrame({'Regions': [elem for elem in Regions for x inrange(14)], \
'Score1' : np.random.rand(56), 'Score2' : np.random.rand(56), 'Score3' : np.random.rand(56), \
'Score4' : np.random.rand(56), 'Score5' : np.random.rand(56)}, index = list(Dates)*4)
# Create a dictionary to hold your data
SCORES = ['Score1', 'Score2', 'Score3', 'Score4', 'Score5']
ValuesDict = {region : {score : [int(), int()] for score in SCORES} for region in df.Regions.unique()}
#This dictionary will contain keys that are your regions, and these will correspond to a dictionary that has keys that are your scores and those correspond to a list of which the fisrt element is the most recent - last quarter calculation, and the second is the most recent - last year calcuation. #Now group the data
dfGrouped = df.groupby('Regions')
#Now iterate through the groups creating lists of the underlying data. The data that is at the last index point of the list is by definition the newest (due to the sorting when grouping) and the obervation one year previous to that is - 4 index points away.for group in dfGrouped:
Score1List = list(group[1].Score1)
Score2List = list(group[1].Score2)
Score3List = list(group[1].Score3)
Score4List = list(group[1].Score4)
Score5List = list(group[1].Score5)
MasterList = [Score1List, Score2List, Score3List, Score4List, Score5List]
for x in xrange(1, 6):
ValuesDict[group[0]]['Score' + str(x)][0] = MasterList[x-1][-1] - MasterList[x-1][-2]
ValuesDict[group[0]]['Score' + str(x)][1] = MasterList[x-1][-1] - MasterList[x-1][-5]
ValuesDict
Its a bit convoluted, but this is the way I often approach these types of problems. Values dict contains all the data you need, but I'm having difficulty getting it into a dataframe.
Post a Comment for "Summary Calculations On A Pandas Dataframe"