Skip to content Skip to sidebar Skip to footer

The Result Of Dataframe.mean() Is Incorrect

I am workint in Python 2.7 and I have a data frame and I want to get the average of the column called 'c', but only the rows that verify that the values in another column are equal

Solution 1:

Pandas is doing string concatenation for the "sum" when calculating the mean, this is plain to see from your example frame.


>>>df[df.a == 'B'].c
3    2
4    6
5    6
Name: c, dtype: object
>>>266 / 3
88.66666666666667

If you look at the dtype's for your DataFrame, you'll notice that all of them are object, even though no single Series contains mixed types. This is due to the declaration of your numpy array. Arrays are not meant to contain heterogenous types, so the array defaults to dtype object, which is then passed to the DataFrame constructor. You can avoid this behavior by passing the constructor a list instead, which can hold differing dtype's with no issues.


df = pd.DataFrame(
    [['A', 1, 2, 3], ['A', 4, 5, np.nan], ['A', 7, 8, 9], ['B', 3, 2, np.nan], ['B', 5, 6, np.nan], ['B',5, 6, np.nan]],
    columns=['a', 'b', 'c', 'd']
)

df[df.a == 'B'].c.mean()

4.666666666666667

In[17]: df.dtypesOut[17]:
aobjectbint64cint64dfloat64dtype: object

I still can't imagine that this behavior is intended, so I believe it's worth opening an issue report on the pandas development page, but in general, you shouldn't be using object dtype Series for numeric calculations.

Post a Comment for "The Result Of Dataframe.mean() Is Incorrect"