Getting Descriptive Statistics With (analytic) Weighting Using Describe() In Python

I was trying to translate code from Stata to Python The original code in Stata: by year, sort : summarize age [aweight = wt] Normally a simply describe() function will do datafram

Solution 1:

So I wrote a function that performs the same thing as describe except taking a weight argument. I tested it on the small dataframe you provided, but haven't gone into too much detail. I tried not to use .apply in case you have a large dataframe, though I didn't run a bench mark to see if my approach would be faster/less memory intensive than writing a function to do a weighted describe for each by group and then using apply to apply that to each by group in the dataframe. That would probably be easiest.

Counts, min and max can be taken without regard to weighting. Then I did simple weighted mean and std. deviation--from formula for unbiased variance. I included an option for frequency weighting, which should just effect the sample size used to adjust the variance to the unbiased estimator. Frequency weights should use the sum of the weights as the sample size, otherwise, uses the count in the data. I used this answer to help get weighted percentiles.

importpandasaspdimportnumpyasnpdf=pd.DataFrame({'year': [2016,2016,2020, 2020],'age': [41,65, 35,28],'wt':[1.2,0.7,0.8,1.5]})dfyearagewt02016   411.212016   650.722020   350.832020   281.5

Then I define the function below.

defweighted_groupby_describe(df, col, by, wt, frequency=False):
    df : dataframe
    col : column for which you want statistics, must be single column
    by : groupby column(s)
    wt : column to use for weights
    frequency : if True, use sample size as sum of weights (only effects degrees
    of freedom correction for unbiased variance)
    '''ifisinstance(by, list):
        df = df.sort_values(by+[col])
        df = df.sort_values([by] + [col])
    newcols = ['gb_weights', 'col_weighted', 'col_mean', 
        'col_sqdiff', 'col_sqdiff_weighted', 'gb_weights_cumsum', 'ngroup']
    assertall([c notin df.columns for c in newcols])
    df['gb_weights'] = df[wt]/df.groupby(by)[wt].transform('sum')
    df['gb_weights_cumsum'] = df.groupby(by)['gb_weights'].cumsum()
    df['col_weighted'] = df.eval('{}*gb_weights'.format(col))
    df['col_mean'] = df.groupby(by)['col_weighted'].transform('sum')
    df['col_sqdiff'] = df.eval('({}-col_mean)**2'.format(col))
    df['col_sqdiff_weighted'] = df.eval('col_sqdiff*gb_weights')
    wstd = df.groupby(by)['col_sqdiff_weighted'].sum()**(0.5) = 'std'
    wmean = df.groupby(by)['col_weighted'].sum() = 'mean'
    df['ngroup'] = df.groupby(by).ngroup()
    quantiles = np.array([0.25, 0.5, 0.75])
    weighted_quantiles = df['gb_weights_cumsum'] - 0.5*df['gb_weights'] + df['ngroup']
    ngroups = df['ngroup'].max()+1
    x = np.hstack([quantiles+i for i inrange(ngroups)])
    quantvals = np.interp(x, weighted_quantiles, df[col])
    quantvals = np.reshape(quantvals, (ngroups, -1))
    other = df.groupby(by)[col].agg(['min', 'max', 'count'])
    stats = pd.concat([wmean, wstd, other], axis=1, sort=False)
    stats['25%'] = quantvals[:, 0]
    stats['50%'] = quantvals[:, 1]
    stats['75%'] = quantvals[:, 2]
    colorder = ['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']
    stats = stats[colorder]
    if frequency:
        sizes = df.groupby(by)[wt].sum()
        sizes = stats['count']
    stats['weight'] = sizes
    # use the "sample size" (weight) to obtain std. deviation from unbiased# variance
    stats['std'] = stats.eval('((std**2)*(weight/(weight-1)))**(1/2)')
    return stats

And test it out.

weighted_groupby_describe(df, 'age', 'year', 'wt')
      count       mean        std  min  ...        50%        75%  max  weight
year                                    ...                                   
2016249.84210516.37239841  ...  49.84210561.8421056522020230.4347834.71493628  ...  30.43478333.934783352

Compare this to the output without the weights.

      count  mean        std   min25%   50%    75%   max

