这应该快得多:
df = pd.Dataframe({'list1': [["a","b"], ["a","c"], ["a","d"], ["b","c"], ["b","d"], ["c","d"]]*100})df2 = pd.Dataframe({'list2': [["a","b","c","d"], ["a","b"], ["b","c"], ["c","d"], ["b","c"]]*100})list2 = df2['list2'].map(set).tolist()df['occurance'] = df['list1'].apply(set).apply(lambda x: len([i for i in list2 if x.issubset(i)]))使用您的方法:
%timeit for index, row in df.iterrows(): df.at[index, "occurrence"] = df2["list2"].apply(lambda x: all(i in x for i in row['list1'])).sum()
1个循环,每个循环最多3:3.98 s使用我的:
%timeit list2 = df2['list2'].map(set).tolist();df['occurance'] = df['list1'].apply(set).apply(lambda x: len([i for i in list2 if x.issubset(i)]))
10个循环,最好为3:每个循环29.7 ms
请注意,我已将列表的大小增加了100倍。
编辑
这似乎更快:
list2 = df2['list2'].sort_values().tolist()df['occurance'] = df['list1'].apply(lambda x: len(list(next(iter(())) if not all(i in list2 for i in x) else i for i in x)))
和时间:
%timeit list2 = df2['list2'].sort_values().tolist();df['occurance'] = df['list1'].apply(lambda x: len(list(next(iter(())) if not all(i in list2 for i in x) else i for i in x)))
100个循环,最好为3:每个循环14.8 ms



