这是保留每列中小于或等于指定数量的nan的列的另一种选择:
max_number_of_nas = 3000df = df.loc[:, (df.isnull().sum(axis=0) <= max_number_of_nas)]
在我的测试中,这似乎比李建勋在我测试的案例中建议的放置列方法要快一些:
np.random.seed(0)df = pd.Dataframe(np.random.randn(10000,5), columns=list('ABCDE'))df[df < 0] = np.nanmax_number_of_nans = 5010%timeit c = df.loc[:, (df.isnull().sum(axis=0) <= max_number_of_nans)]>> 1000 loops, best of 3: 1.76 ms per loop%%timeit c = df.drop(df.columns[df.apply(lambda col: col.isnull().sum() > max_number_of_nans)], axis=1)>> 100 loops, best of 3: 2.04 ms per loopnp.random.seed(0)df = pd.Dataframe(np.random.randn(10, 5), columns=list('ABCDE'))df[df < 0] = np.nanmax_number_of_nans = 5%timeit c = df.loc[:, (df.isnull().sum(axis=0) <= max_number_of_nans)]>> 1000 loops, best of 3: 662 µs per loop%%timeit c = df.drop(df.columns[df.apply(lambda col: col.isnull().sum() > max_number_of_nans)], axis=1)>> 1000 loops, best of 3: 1.08 ms per loop


