这是一个非常有趣的问题!
我认为这取决于以下几个方面:
按索引访问单行( 索引已排序且唯一 )应具有运行时
O(m),其中
m << n_rows
按索引访问单行( 索引不是唯一的并且未排序 )应该具有运行时
O(n_rows)
通过索引访问单行( 索引不是唯一的,并且排序 )应该有运行
O(m)在那里
m < n_rows)
通过布尔索引访问行(独立于索引)应具有运行时
O(n_rows)
演示:
索引是排序且唯一的:
In [49]: df = pd.Dataframe(np.random.rand(10**5,6), columns=list('abcdef'))In [50]: %timeit df.loc[random.randint(0, 10**4)]The slowest run took 27.65 times longer than the fastest. This could mean that an intermediate result is being cached.1000 loops, best of 3: 331 µs per loopIn [51]: %timeit df.iloc[random.randint(0, 10**4)]1000 loops, best of 3: 275 µs per loopIn [52]: %timeit df.query("a > 0.9")100 loops, best of 3: 7.84 ms per loopIn [53]: %timeit df.loc[df.a > 0.9]100 loops, best of 3: 2.96 ms per loop索引未排序且也不唯一:
In [54]: df = pd.Dataframe(np.random.rand(10**5,6), columns=list('abcdef'), index=np.random.randint(0, 10000, 10**5))In [55]: %timeit df.loc[random.randint(0, 10**4)]100 loops, best of 3: 12.3 ms per loopIn [56]: %timeit df.iloc[random.randint(0, 10**4)]1000 loops, best of 3: 262 µs per loopIn [57]: %timeit df.query("a > 0.9")100 loops, best of 3: 7.78 ms per loopIn [58]: %timeit df.loc[df.a > 0.9]100 loops, best of 3: 2.93 ms per loop索引不是唯一的,并且进行了排序:
In [64]: df = pd.Dataframe(np.random.rand(10**5,6), columns=list('abcdef'), index=np.random.randint(0, 10000, 10**5)).sort_index()In [65]: df.index.is_monotonic_increasingOut[65]: TrueIn [66]: %timeit df.loc[random.randint(0, 10**4)]The slowest run took 9.70 times longer than the fastest. This could mean that an intermediate result is being cached.1000 loops, best of 3: 478 µs per loopIn [67]: %timeit df.iloc[random.randint(0, 10**4)]1000 loops, best of 3: 262 µs per loopIn [68]: %timeit df.query("a > 0.9")100 loops, best of 3: 7.81 ms per loopIn [69]: %timeit df.loc[df.a > 0.9]100 loops, best of 3: 2.95 ms per loop


