这与的大小相似,但没有对象列
In [10]: nrows = 10000000In [11]: df = pd.concat([Dataframe(randn(int(nrows),34),columns=[ 'f%s' % i for i in range(34) ]),Dataframe(randint(0,10,size=int(nrows*19)).reshape(int(nrows),19),columns=[ 'i%s' % i for i in range(19) ])],axis=1)In [12]: df.iloc[1000:10000,0:20] = np.nanIn [13]: df.info()<class 'pandas.core.frame.Dataframe'>Int64Index: 10000000 entries, 0 to 9999999Data columns (total 53 columns):f0 9991000 non-null valuesf1 9991000 non-null valuesf2 9991000 non-null valuesf3 9991000 non-null valuesf4 9991000 non-null valuesf5 9991000 non-null valuesf6 9991000 non-null valuesf7 9991000 non-null valuesf8 9991000 non-null valuesf9 9991000 non-null valuesf10 9991000 non-null valuesf11 9991000 non-null valuesf12 9991000 non-null valuesf13 9991000 non-null valuesf14 9991000 non-null valuesf15 9991000 non-null valuesf16 9991000 non-null valuesf17 9991000 non-null valuesf18 9991000 non-null valuesf19 9991000 non-null valuesf20 10000000 non-null valuesf21 10000000 non-null valuesf22 10000000 non-null valuesf23 10000000 non-null valuesf24 10000000 non-null valuesf25 10000000 non-null valuesf26 10000000 non-null valuesf27 10000000 non-null valuesf28 10000000 non-null valuesf29 10000000 non-null valuesf30 10000000 non-null valuesf31 10000000 non-null valuesf32 10000000 non-null valuesf33 10000000 non-null valuesi0 10000000 non-null valuesi1 10000000 non-null valuesi2 10000000 non-null valuesi3 10000000 non-null valuesi4 10000000 non-null valuesi5 10000000 non-null valuesi6 10000000 non-null valuesi7 10000000 non-null valuesi8 10000000 non-null valuesi9 10000000 non-null valuesi10 10000000 non-null valuesi11 10000000 non-null valuesi12 10000000 non-null valuesi13 10000000 non-null valuesi14 10000000 non-null valuesi15 10000000 non-null valuesi16 10000000 non-null valuesi17 10000000 non-null valuesi18 10000000 non-null valuesdtypes: float64(34), int64(19)
时间(与您相似的机器规格)
In [14]: %timeit df.mean()1 loops, best of 3: 21.5 s per loop
您可以通过预先转换为浮点数来获得2倍的加速(平均值是这样做的,但是这样做的方式更通用,所以更慢)
In [15]: %timeit df.astype('float64').mean()1 loops, best of 3: 9.45 s per loop您的问题是对象列。Mean将尝试为所有列进行计算,但是由于对象列的存在,所有内容都被转换为
objectdtype,这对于计算效率不高。
最好的选择是做
df._get_numeric_data().mean()
可以
numeric_only在较低级别上执行此操作,但是由于某些原因,我们不通过顶级功能(例如Mean)直接支持此操作。我认为添加此参数会产生问题。但是
False,默认情况下将为prob
(不排除)。



