更快地替代熊猫“ isin”功能

编辑2：这是一个指向各种

pandas

操作性能的最新视图的链接，尽管它似乎并不包括到目前为止的合并和联接。

https://github.com/mm-mansour/Fast-Pandas

编辑1：这些基准是针对较旧版本的熊猫，可能仍然不相关。请参阅下面有关Mike的评论

merge

。

这取决于数据的大小，但是对于大型数据集，Dataframe.join似乎是解决之道。这要求您的Dataframe索引为您的“
ID”，而您要加入的Series或Dataframe的索引为您的“
ID_list”。系列还必须具有

name

与一起使用的

join

，它将作为一个名为的新字段引入

name

。您还需要指定一个内部联接以获取类似内容，

isin

因为

join

默认情况下为左联接。查询

in

语法似乎具有与

isin

大型数据集相同的速度特征。

如果您使用的是小型数据集，则会得到不同的行为，使用列表推导或将其应用于字典实际上比使用更快

isin

。

否则，您可以尝试使用Cython来提高速度。

# I'm ignoring that the index is defaulting to a sequential number. You# would need to explicitly assign your IDs to the index here, e.g.:# >>> l_series.index = ID_listmil = range(1000000)l = mill_series = pd.Series(l)df = pd.Dataframe(l_series, columns=['ID'])In [247]: %timeit df[df.index.isin(l)]1 loops, best of 3: 1.12 s per loopIn [248]: %timeit df[df.index.isin(l_series)]1 loops, best of 3: 549 ms per loop# index vs column doesn't make a difference hereIn [304]: %timeit df[df.ID.isin(l_series)]1 loops, best of 3: 541 ms per loopIn [305]: %timeit df[df.index.isin(l_series)]1 loops, best of 3: 529 ms per loop# query 'in' syntax has the same performance as 'isin'In [249]: %timeit df.query('index in @l')1 loops, best of 3: 1.14 s per loopIn [250]: %timeit df.query('index in @l_series')1 loops, best of 3: 564 ms per loop# ID must be the index for Dataframe.join and l_series must have a name.# join defaults to a left join so we need to specify inner for existence.In [251]: %timeit df.join(l_series, how='inner')10 loops, best of 3: 93.3 ms per loop# Smaller datasets.df = pd.Dataframe([1,2,3,4], columns=['ID'])l = range(10000)l_dict = dict(zip(l, l))l_series = pd.Series(l)l_series.name = 'ID_list'In [363]: %timeit df.join(l_series, how='inner')1000 loops, best of 3: 733 µs per loopIn [291]: %timeit df[df.ID.isin(l_dict)]1000 loops, best of 3: 742 µs per loopIn [292]: %timeit df[df.ID.isin(l)]1000 loops, best of 3: 771 µs per loopIn [294]: %timeit df[df.ID.isin(l_series)]100 loops, best of 3: 2 ms per loop# It's actually faster to use apply or a list comprehension for these small cases.In [296]: %timeit df[[x in l_dict for x in df.ID]]1000 loops, best of 3: 203 µs per loopIn [299]: %timeit df[df.ID.apply(lambda x: x in l_dict)]1000 loops, best of 3: 297 µs per loop

更快地替代熊猫“ isin”功能

面试问答相关栏目本月热门文章