写代码中遇到的一点小问题,如果快速删除连续重复的多行并且类型是可选择的,重复的数量也是可选择的。
比如:
df=pd.DataFrame.from_dict(
{'id':[1,2,3,4,5,6,7,8,9,10,11,13,14,15,16,17,18,19,20,21,22,23],
'value'['equal','equal','insert','equal','equal','equal','equal','delete','delete','delete','equal','delete','insert','insert','insert','insert','equal','equal','equal','equal','equal','equal']})
df
Out[3]:
id value
0 1 equal
1 2 equal
2 3 insert
3 4 equal
4 5 equal
5 6 equal
6 7 equal
7 8 delete
8 9 delete
9 10 delete
10 11 equal
11 13 delete
12 14 insert
13 15 insert
14 16 insert
15 17 insert
16 18 equal
17 19 equal
18 20 equal
19 21 equal
20 22 equal
21 23 equal
删除value=equal,并且连续长度大于等于4的行
最简单的方法就是利用遍历,计算除equal连续长度的累加值,进行判断,然后根据下标进行删除。
但是这样要写很多代码进行逻辑判断。
有没有更简单的方法哪?
基于强大的pandas库,我们是可以优雅的实现以上操作。
首先给出代码:然后一行一行解析
df2 = df.groupby((df['value'].shift() != df['value']).cumsum()).filter(lambda x: len(x) >= 4)
df3 = df2[df2['value'].isin(['equal'])]
result = pd.merge(df, df3, how='left', indicator=True).query("_merge=='left_only'").drop('_merge', 1)
第一行
df.groupby((df['value'].shift() != df['value']).cumsum()).filter(lambda x: len(x) >= 4)
df[‘value’].shift()是将每一行的值下移动一行,
可以看到,第一行变为NaN,值依次往下移动一位。
df['value'].shift() Out[4]: 0 NaN 1 equal 2 equal 3 insert 4 equal 5 equal 6 equal 7 equal 8 delete 9 delete 10 delete 11 equal 12 delete 13 insert 14 insert 15 insert 16 insert 17 equal 18 equal 19 equal 20 equal 21 equal Name: value, dtype: object
shift()可参考这个链接
第二步进行bool运算,不相等为True,相等为False
df['value'].shift() != df['value'] Out[5]: 0 True 1 False 2 True 3 True 4 False 5 False 6 False 7 True 8 False 9 False 10 True 11 True 12 True 13 False 14 False 15 False 16 True 17 False 18 False 19 False 20 False 21 False Name: value, dtype: bool
然后cumsum给出一个非降序的id序列,其中每个id表示一个具有相同值的连续块,注意在对布尔值求和时,True被认为是一,而False被认为是零:
(df['value'].shift() != df['value']).cumsum() Out[6]: 0 1 1 1 2 2 3 3 4 3 5 3 6 3 7 4 8 4 9 4 10 5 11 6 12 7 13 7 14 7 15 7 16 8 17 8 18 8 19 8 20 8 21 8 Name: value, dtype: int32
对value这一列进行累加计算
cumsum可参考这个链接
最后进行groupby按照相同长度进行聚合
df_group =df.groupby((df['value'].shift() != df['value']).cumsum()) for i in df_group: ...: print(i) ...: (1, id value 0 1 equal 1 2 equal) (2, id value 2 3 insert) (3, id value 3 4 equal 4 5 equal 5 6 equal 6 7 equal) (4, id value 7 8 delete 8 9 delete 9 10 delete) (5, id value 10 11 equal) (6, id value 11 13 delete) (7, id value 12 14 insert 13 15 insert 14 16 insert 15 17 insert) (8, id value 16 18 equal 17 19 equal 18 20 equal 19 21 equal 20 22 equal 21 23 equal)
最后过滤一下长度大于4的连续的值
df.groupby((df['value'].shift() != df['value']).cumsum()).filter(lambda x: len(x) >= 4)
Out[11]:
id value
3 4 equal
4 5 equal
5 6 equal
6 7 equal
12 14 insert
13 15 insert
14 16 insert
15 17 insert
16 18 equal
17 19 equal
18 20 equal
19 21 equal
20 22 equal
21 23 equal
因为我们只需要equal这个值的内容,删除选择特定列可参考这个链接
df2[df2['value'].isin(['equal'])]
Out[16]:
id value
3 4 equal
4 5 equal
5 6 equal
6 7 equal
16 18 equal
17 19 equal
18 20 equal
19 21 equal
20 22 equal
21 23 equal
最后从大的dataframe中删除小的dataframe即可
result = pd.merge(df, df3, how='left', indicator=True).query("_merge=='left_only'").drop('_merge', 1)
Out[17]:
id value
0 1 equal
1 2 equal
2 3 insert
7 8 delete
8 9 delete
9 10 delete
10 11 equal
11 13 delete
12 14 insert
13 15 insert
14 16 insert
15 17 insert
merge操作可参考:这个链接
官方文档:https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
其中query的操作是新增一列信息,
可参考:https://blog.csdn.net/AlanGuoo/article/details/88874742
参考:
https://www.code05.com/question/detail/45886518.html



