The Result Of Dataframe.mean() Is Incorrect
Solution 1:
Pandas is doing string concatenation for the "sum" when calculating the mean, this is plain to see from your example frame.
>>>df[df.a == 'B'].c
3 2
4 6
5 6
Name: c, dtype: object
>>>266 / 3
88.66666666666667
If you look at the dtype
's for your DataFrame, you'll notice that all of them are object
, even though no single Series
contains mixed types. This is due to the declaration of your numpy
array. Arrays are not meant to contain heterogenous types, so the array defaults to dtype object
, which is then passed to the DataFrame constructor. You can avoid this behavior by passing the constructor a list instead, which can hold differing dtype
's with no issues.
df = pd.DataFrame(
[['A', 1, 2, 3], ['A', 4, 5, np.nan], ['A', 7, 8, 9], ['B', 3, 2, np.nan], ['B', 5, 6, np.nan], ['B',5, 6, np.nan]],
columns=['a', 'b', 'c', 'd']
)
df[df.a == 'B'].c.mean()
4.666666666666667
In[17]: df.dtypesOut[17]:
aobjectbint64cint64dfloat64dtype: object
I still can't imagine that this behavior is intended, so I believe it's worth opening an issue report on the pandas development page, but in general, you shouldn't be using object
dtype Series for numeric calculations.
Post a Comment for "The Result Of Dataframe.mean() Is Incorrect"