栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 面试经验 > 面试问答

从大型结构化文本文件中提取信息

面试问答 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

从大型结构化文本文件中提取信息

很好 下面是一些建议,如果您喜欢,请告诉我:

import reimport pprintimport sysclass Despacho(object):    """    Class to parse each line, applying the regexp and storing the results    for future use    """    #used a dict with the keys instead of functions.    regexp = {        ('processo',          'data',          'despacho'): re.compile(r'No.([d]{9})  ([d]{2}/[d]{2}/[d]{4})  (.*)'),        ('titular',): re.compile(r'Tit.(.*)'),        ('procurador',): re.compile(r'Procurador: (.*)'),        ('documento',): re.compile(r'C.N.P.J./C.I.C./N INPI :(.*)'),        ('apresentacao',         'natureza'): re.compile(r'Apres.: (.*) ; Nat.: (.*)'),        ('marca',): re.compile(r'Marca: (.*)'),        ('classe',): re.compile(r'Clas.Prod/Serv: (.*)'),        ('complemento',): re.compile(r'*(.*)'),    }    def __init__(self):        """        'complemento' is the only field that can be multiple in a single registry        """        self.complemento = []    def read(self, line):        for attrs, pattern in Despacho.regexp.iteritems(): m = pattern.match(line) if m:     for groupn, attr in enumerate(attrs):         # special case complemento:         if attr == 'complemento':  self.complemento.append(m.group(groupn + 1))         else:  # set the attribute on the object  setattr(self, attr, m.group(groupn + 1))    def __repr__(self):        # defines object printed representation        d = {}        for attrs in self.regexp: for attr in attrs:     d[attr] = getattr(self, attr, None)        return pprint.pformat(d)def process(rpi):    """    read data and process each group    """    #Useless line, since you're doing a for anyway    #rpi = (line for line in rpi)    group = False    for line in rpi:        if line.startswith('No.'): group = True d = Despacho()        if not line.strip() and group: # empty line - end of block yield d group = False        d.read(line)def main():    arquivo = open('rm1972.txt') # file to process    for desp in process(arquivo):        print desp # can print directly here.        print('-' * 20)    return 0if __name__ == '__main__':    main()


转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/634129.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号