Getting Descriptive Statistics With (analytic) Weighting Using Describe() In Python
Solution 1:
So I wrote a function that performs the same thing as describe
except taking a weight argument. I tested it on the small dataframe you provided, but haven't gone into too much detail. I tried not to use .apply
in case you have a large dataframe, though I didn't run a bench mark to see if my approach would be faster/less memory intensive than writing a function to do a weighted describe for each by
group and then using apply
to apply that to each by
group in the dataframe. That would probably be easiest.
Counts, min and max can be taken without regard to weighting. Then I did simple weighted mean and std. deviation--from formula for unbiased variance. I included an option for frequency weighting, which should just effect the sample size used to adjust the variance to the unbiased estimator. Frequency weights should use the sum of the weights as the sample size, otherwise, uses the count in the data. I used this answer to help get weighted percentiles.
importpandasaspdimportnumpyasnpdf=pd.DataFrame({'year': [2016,2016,2020, 2020],'age': [41,65, 35,28],'wt':[1.2,0.7,0.8,1.5]})dfyearagewt02016 411.212016 650.722020 350.832020 281.5
Then I define the function below.
defweighted_groupby_describe(df, col, by, wt, frequency=False):
'''
df : dataframe
col : column for which you want statistics, must be single column
by : groupby column(s)
wt : column to use for weights
frequency : if True, use sample size as sum of weights (only effects degrees
of freedom correction for unbiased variance)
'''ifisinstance(by, list):
df = df.sort_values(by+[col])
else:
df = df.sort_values([by] + [col])
newcols = ['gb_weights', 'col_weighted', 'col_mean',
'col_sqdiff', 'col_sqdiff_weighted', 'gb_weights_cumsum', 'ngroup']
assertall([c notin df.columns for c in newcols])
df['gb_weights'] = df[wt]/df.groupby(by)[wt].transform('sum')
df['gb_weights_cumsum'] = df.groupby(by)['gb_weights'].cumsum()
df['col_weighted'] = df.eval('{}*gb_weights'.format(col))
df['col_mean'] = df.groupby(by)['col_weighted'].transform('sum')
df['col_sqdiff'] = df.eval('({}-col_mean)**2'.format(col))
df['col_sqdiff_weighted'] = df.eval('col_sqdiff*gb_weights')
wstd = df.groupby(by)['col_sqdiff_weighted'].sum()**(0.5)
wstd.name = 'std'
wmean = df.groupby(by)['col_weighted'].sum()
wmean.name = 'mean'
df['ngroup'] = df.groupby(by).ngroup()
quantiles = np.array([0.25, 0.5, 0.75])
weighted_quantiles = df['gb_weights_cumsum'] - 0.5*df['gb_weights'] + df['ngroup']
ngroups = df['ngroup'].max()+1
x = np.hstack([quantiles+i for i inrange(ngroups)])
quantvals = np.interp(x, weighted_quantiles, df[col])
quantvals = np.reshape(quantvals, (ngroups, -1))
other = df.groupby(by)[col].agg(['min', 'max', 'count'])
stats = pd.concat([wmean, wstd, other], axis=1, sort=False)
stats['25%'] = quantvals[:, 0]
stats['50%'] = quantvals[:, 1]
stats['75%'] = quantvals[:, 2]
colorder = ['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']
stats = stats[colorder]
if frequency:
sizes = df.groupby(by)[wt].sum()
else:
sizes = stats['count']
stats['weight'] = sizes
# use the "sample size" (weight) to obtain std. deviation from unbiased# variance
stats['std'] = stats.eval('((std**2)*(weight/(weight-1)))**(1/2)')
return stats
And test it out.
weighted_groupby_describe(df, 'age', 'year', 'wt')
count mean std min ... 50% 75% max weight
year ...
2016249.84210516.37239841 ... 49.84210561.8421056522020230.4347834.71493628 ... 30.43478333.934783352
Compare this to the output without the weights.
df.groupby('year')['age'].describe()
count mean std min25% 50% 75% max
year
20162.053.016.97056341.047.0053.059.0065.020202.031.54.94974728.029.7531.533.2535.0
Post a Comment for "Getting Descriptive Statistics With (analytic) Weighting Using Describe() In Python"