Pandas DataFrame搜索是线性时间还是恒定时间？

这是一个非常有趣的问题！

我认为这取决于以下几个方面：

按索引访问单行（ 索引已排序且唯一 ）应具有运行时

O(m)

，其中

m << n_rows

按索引访问单行（ 索引不是唯一的并且未排序 ）应该具有运行时

O(n_rows)

通过索引访问单行（ 索引不是唯一的，并且排序 ）应该有运行

O(m)

在那里

m < n_rows

）

通过布尔索引访问行（独立于索引）应具有运行时

O(n_rows)

演示：

索引是排序且唯一的：

In [49]: df = pd.Dataframe(np.random.rand(10**5,6), columns=list('abcdef'))In [50]: %timeit df.loc[random.randint(0, 10**4)]The slowest run took 27.65 times longer than the fastest. This could mean that an intermediate result is being cached.1000 loops, best of 3: 331 µs per loopIn [51]: %timeit df.iloc[random.randint(0, 10**4)]1000 loops, best of 3: 275 µs per loopIn [52]: %timeit df.query("a > 0.9")100 loops, best of 3: 7.84 ms per loopIn [53]: %timeit df.loc[df.a > 0.9]100 loops, best of 3: 2.96 ms per loop

索引未排序且也不唯一：

In [54]: df = pd.Dataframe(np.random.rand(10**5,6), columns=list('abcdef'), index=np.random.randint(0, 10000, 10**5))In [55]: %timeit df.loc[random.randint(0, 10**4)]100 loops, best of 3: 12.3 ms per loopIn [56]: %timeit df.iloc[random.randint(0, 10**4)]1000 loops, best of 3: 262 µs per loopIn [57]: %timeit df.query("a > 0.9")100 loops, best of 3: 7.78 ms per loopIn [58]: %timeit df.loc[df.a > 0.9]100 loops, best of 3: 2.93 ms per loop

索引不是唯一的，并且进行了排序：

In [64]: df = pd.Dataframe(np.random.rand(10**5,6), columns=list('abcdef'), index=np.random.randint(0, 10000, 10**5)).sort_index()In [65]: df.index.is_monotonic_increasingOut[65]: TrueIn [66]: %timeit df.loc[random.randint(0, 10**4)]The slowest run took 9.70 times longer than the fastest. This could mean that an intermediate result is being cached.1000 loops, best of 3: 478 µs per loopIn [67]: %timeit df.iloc[random.randint(0, 10**4)]1000 loops, best of 3: 262 µs per loopIn [68]: %timeit df.query("a > 0.9")100 loops, best of 3: 7.81 ms per loopIn [69]: %timeit df.loc[df.a > 0.9]100 loops, best of 3: 2.95 ms per loop

Pandas DataFrame搜索是线性时间还是恒定时间？

面试问答相关栏目本月热门文章