Replace Nan Value With A Median?
So I am trying to use Pandas to replace all NaN values in a table with the median across a particular range. I am working with a larger dataset but for example np.random.seed(0) rn
Solution 1:
You can use groupby.transform
and fillna:
cols = ['Val','Dist']
df[cols] = df[cols].fillna(df.groupby(df.Date.dt.floor('H'))
[cols].transform('median')
)
Output:
DateValDist02020-09-24 00:00:00 1.7640520.86443612020-09-24 00:12:00 0.4001570.65361922020-09-24 00:24:00 0.9787380.86443632020-09-24 00:36:00 2.2408930.86443642020-09-24 00:48:00 1.8675582.26975552020-09-24 01:00:00 0.1536900.75755962020-09-24 01:12:00 0.9500880.04575972020-09-24 01:24:00 -0.151357-0.18718482020-09-24 01:36:00 -0.1032191.53277992020-09-24 01:48:00 0.4105991.469359102020-09-24 02:00:00 0.1440440.154947112020-09-24 02:12:00 1.4542740.378163122020-09-24 02:24:00 0.7610380.154947132020-09-24 02:36:00 0.1216750.154947142020-09-24 02:48:00 0.443863-0.347912152020-09-24 03:00:00 0.3336740.156349162020-09-24 03:12:00 1.4940791.230291172020-09-24 03:24:00 -0.2051581.202380182020-09-24 03:36:00 0.313068-0.387327192020-09-24 03:48:00 0.323371-0.302303
Solution 2:
You can use a groupby -> transform
operation, while also utilizing the pd.Grouper
class to perform the hourly conversion. This will essentially create a dataframe with the same shape as your original with the hourly medians. Once you have this, you can directly use DataFrame.fillna
hourly_medians=df.groupby(pd.Grouper(key="Date",freq="H")).transform("median")out=df.fillna(hourly_medians)print(out)DateValDist02020-09-24 00:00:00 1.7640520.86443612020-09-24 00:12:00 0.4001570.65361922020-09-24 00:24:00 0.9787380.86443632020-09-24 00:36:00 2.2408930.86443642020-09-24 00:48:00 1.8675582.26975552020-09-24 01:00:00 0.1536900.75755962020-09-24 01:12:00 0.9500880.04575972020-09-24 01:24:00 -0.151357-0.18718482020-09-24 01:36:00 -0.1032191.53277992020-09-24 01:48:00 0.4105991.469359102020-09-24 02:00:00 0.1440440.154947112020-09-24 02:12:00 1.4542740.378163122020-09-24 02:24:00 0.7610380.154947132020-09-24 02:36:00 0.1216750.154947142020-09-24 02:48:00 0.443863-0.347912152020-09-24 03:00:00 0.3336740.156349162020-09-24 03:12:00 1.4940791.230291172020-09-24 03:24:00 -0.2051581.202380182020-09-24 03:36:00 0.313068-0.387327192020-09-24 03:48:00 0.323371-0.302303
Solution 3:
Using what you've done, I'd do this:
df.Val = df.Val.fillna(df.Hour.map(df_val.squeeze()))
df.Dist = df.Val.fillna(df.Hour.map(df_dist.squeeze()))
Solution 4:
You can define a function for the required task:
def impute_nan(df,var,median):
df['new_'+var] = df[var].fillna(median)
median = df.Val.medain()
median
impute_nan(df,'Val',median)
this will give you a new coln named 'new_Val' with replaced NAN values.
Post a Comment for "Replace Nan Value With A Median?"