Skip to content Skip to sidebar Skip to footer

Replace Nan Value With A Median?

So I am trying to use Pandas to replace all NaN values in a table with the median across a particular range. I am working with a larger dataset but for example np.random.seed(0) rn

Solution 1:

You can use groupby.transform and fillna:

cols = ['Val','Dist']
df[cols] =  df[cols].fillna(df.groupby(df.Date.dt.floor('H'))
                              [cols].transform('median')
                           )

Output:

DateValDist02020-09-24 00:00:00  1.7640520.86443612020-09-24 00:12:00  0.4001570.65361922020-09-24 00:24:00  0.9787380.86443632020-09-24 00:36:00  2.2408930.86443642020-09-24 00:48:00  1.8675582.26975552020-09-24 01:00:00  0.1536900.75755962020-09-24 01:12:00  0.9500880.04575972020-09-24 01:24:00 -0.151357-0.18718482020-09-24 01:36:00 -0.1032191.53277992020-09-24 01:48:00  0.4105991.469359102020-09-24 02:00:00  0.1440440.154947112020-09-24 02:12:00  1.4542740.378163122020-09-24 02:24:00  0.7610380.154947132020-09-24 02:36:00  0.1216750.154947142020-09-24 02:48:00  0.443863-0.347912152020-09-24 03:00:00  0.3336740.156349162020-09-24 03:12:00  1.4940791.230291172020-09-24 03:24:00 -0.2051581.202380182020-09-24 03:36:00  0.313068-0.387327192020-09-24 03:48:00  0.323371-0.302303

Solution 2:

You can use a groupby -> transform operation, while also utilizing the pd.Grouper class to perform the hourly conversion. This will essentially create a dataframe with the same shape as your original with the hourly medians. Once you have this, you can directly use DataFrame.fillna

hourly_medians=df.groupby(pd.Grouper(key="Date",freq="H")).transform("median")out=df.fillna(hourly_medians)print(out)DateValDist02020-09-24 00:00:00  1.7640520.86443612020-09-24 00:12:00  0.4001570.65361922020-09-24 00:24:00  0.9787380.86443632020-09-24 00:36:00  2.2408930.86443642020-09-24 00:48:00  1.8675582.26975552020-09-24 01:00:00  0.1536900.75755962020-09-24 01:12:00  0.9500880.04575972020-09-24 01:24:00 -0.151357-0.18718482020-09-24 01:36:00 -0.1032191.53277992020-09-24 01:48:00  0.4105991.469359102020-09-24 02:00:00  0.1440440.154947112020-09-24 02:12:00  1.4542740.378163122020-09-24 02:24:00  0.7610380.154947132020-09-24 02:36:00  0.1216750.154947142020-09-24 02:48:00  0.443863-0.347912152020-09-24 03:00:00  0.3336740.156349162020-09-24 03:12:00  1.4940791.230291172020-09-24 03:24:00 -0.2051581.202380182020-09-24 03:36:00  0.313068-0.387327192020-09-24 03:48:00  0.323371-0.302303

Solution 3:

Using what you've done, I'd do this:

df.Val = df.Val.fillna(df.Hour.map(df_val.squeeze()))
df.Dist = df.Val.fillna(df.Hour.map(df_dist.squeeze()))

Solution 4:

You can define a function for the required task:

    def impute_nan(df,var,median):
        df['new_'+var] = df[var].fillna(median)
    median = df.Val.medain()
    median
    impute_nan(df,'Val',median)

this will give you a new coln named 'new_Val' with replaced NAN values.

Post a Comment for "Replace Nan Value With A Median?"