Python文本文档去重、去停用词

问题描述

test.txt

你好吗
我很好
今天怎么样
今天怎么样
今天怎么样
今天怎么样
今天怎么样
今天怎么样
今天怎么样
今天怎么样
首先
高兴
是不是
说说

stopword.txt

首先
高兴
是不是
说说

对test.txt去重并去除stopword.txt定义的停用词

解决方案

使用生成器对文档进行读取，防止一次性读取超大文档内存不足

代码

def file_unique(filename, savefile, stopword='', encoding='utf-8'):
    '''文本文档去重去停用词

    :param filename: 需要处理的文本文档
    :param savefile: 保存路径
    :param stopword: 停用词文本文档
    :param encoding: 编码
    :return: 处理后的行数
    '''

    def read(filename, encoding='utf-8'):
        '''读取文本文档生成器'''
        with open(filename, encoding=encoding) as f:
            for line in f:
                yield line.strip()  # 去除空格换行

    file = set(list(read(filename, encoding)))
    if stopword:
        stopword = set(list(read(stopword, encoding)))
    newfile = []
    for i in file:
        if i not in stopword:
            newfile.append(i)
    with open(savefile, mode='w', encoding=encoding) as f:
        for i in newfile:
            f.write(i + 'n')
    return len(newfile)


if __name__ == '__main__':
    print(file_unique(filename='test.txt', savefile='out1.txt', encoding='utf-8'))
    print(file_unique(filename='test.txt', savefile='out2.txt', stopword='stopword.txt', encoding='utf-8'))

结果

out1.txt

你好吗
我很好
今天怎么样
首先
高兴
是不是
说说

out2.txt

你好吗
我很好
今天怎么样

根据拼音排序

from itertools import chain
from pypinyin import pinyin, Style


def to_pinyin(s):
    '''转拼音

    :param s: 字符串或列表
    :type s: str or list
    :return: 拼音字符串
    >>> to_pinyin('你好吗')
    'ni3hao3ma'
    >>> to_pinyin(['你好', '吗'])
    'ni3hao3ma'
    '''
    return ''.join(chain.from_iterable(pinyin(s, style=Style.TONE3)))

newfile = sorted(newfile, key=to_pinyin)  # 根据拼音排序

改进思路

不用newfile改用del file某个元素

参考文献

io — 处理流的核心工具
Python列表去重
Python根据拼音对中文排序

Python文本文档去重、去停用词

Python相关栏目本月热门文章