栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 面试经验 > 面试问答

如何使用Python解析复杂的文本文件?

面试问答 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

如何使用Python解析复杂的文本文件?

更新2019(PEG解析器):

这个答案已经引起了相当多的关注,因此我感到增加了另一种可能性,即解析选项。在这里,我们可以将

PEG
解析器(例如
parsimonious


)与
NodeVisitor
类结合使用:

from parsimonious.grammar import Grammarfrom parsimonious.nodes import NodeVisitorimport pandas as pdgrammar = Grammar(    r"""    schools         = (school_block / ws)+    school_block    = school_header ws grade_block+     grade_block     = grade_header ws name_header ws (number_name)+ ws score_header ws (number_score)+ ws?    school_header   = ~"^School = (.*)"m    grade_header    = ~"^Grade = (d+)"m    name_header     = "Student number, Name"    score_header    = "Student number, Score"    number_name     = index comma name ws    number_score    = index comma score ws    comma= ws? "," ws?    index= number+    score= number+    number          = ~"d+"    name = ~"[A-Z]w+"    ws   = ~"s*"    """)tree = grammar.parse(data)class SchoolVisitor(NodeVisitor):    output, names = ([], [])    current_school, current_grade = None, None    def _getName(self, idx):        for index, name in self.names: if index == idx:     return name    def generic_visit(self, node, visited_children):        return node.text or visited_children    def visit_school_header(self, node, children):        self.current_school = node.match.group(1)    def visit_grade_header(self, node, children):        self.current_grade = node.match.group(1)        self.names = []    def visit_number_name(self, node, children):        index, name = None, None        for child in node.children: if child.expr.name == 'name':     name = child.text elif child.expr.name == 'index':     index = child.text        self.names.append((index, name))    def visit_number_score(self, node, children):        index, score = None, None        for child in node.children: if child.expr.name == 'index':     index = child.text elif child.expr.name == 'score':     score = child.text        name = self._getName(index)        # build the entire entry        entry = (self.current_school, self.current_grade, index, name, score)        self.output.append(entry)sv = SchoolVisitor()sv.visit(tree)df = pd.Dataframe.from_records(sv.output, columns = ['School', 'Grade', 'Student number', 'Name', 'Score'])print(df)

正则表达式选项(原始答案)

好吧,我第十次看了《指环王》,所以我不得不在最后的结局上花些时间:


分解而言,该想法是将问题分解为几个较小的问题:

  1. 分开每所学校
  2. …每个年级
  3. …学生和分数
  4. …然后将它们绑定到数据框中

学校部分(请参阅 regex101.com上的演示

^Schools*=s*(?P<school_name>.+)(?P<school_content>[sS]+?)(?=^School|Z)

成绩部分( regex101.com上的另一个演示

^Grades*=s*(?P<grade>.+)(?P<students>[sS]+?)(?=^Grade|Z)

学生/分数部分( regex101.com上的最后一个演示 ):

^Student number, Name[nr](?P<student_names>(?:^d+.+[nr])+)s*^Student number, Score[nr](?P<student_scores>(?:^d+.+[nr])+)

其余的是生成器表达式,然后将其馈送到

Dataframe
构造函数中(以及列名)。


编码:

import pandas as pd, rerx_school = re.compile(r'''    ^    Schools*=s*(?P<school_name>.+)    (?P<school_content>[sS]+?)    (?=^School|Z)''', re.MULTILINE | re.VERBOSE)rx_grade = re.compile(r'''    ^    Grades*=s*(?P<grade>.+)    (?P<students>[sS]+?)    (?=^Grade|Z)''', re.MULTILINE | re.VERBOSE)rx_student_score = re.compile(r'''    ^    Student number, Name[nr]    (?P<student_names>(?:^d+.+[nr])+)    s*    ^    Student number, Score[nr]    (?P<student_scores>(?:^d+.+[nr])+)''', re.MULTILINE | re.VERBOSE)result = ((school.group('school_name'), grade.group('grade'), student_number, name, score)    for school in rx_school.finditer(string)    for grade in rx_grade.finditer(school.group('school_content'))    for student_score in rx_student_score.finditer(grade.group('students'))    for student in zip(student_score.group('student_names')[:-1].split("n"), student_score.group('student_scores')[:-1].split("n"))    for student_number in [student[0].split(", ")[0]]    for name in [student[0].split(", ")[1]]    for score in [student[1].split(", ")[1]])df = pd.Dataframe(result, columns = ['School', 'Grade', 'Student number', 'Name', 'Score'])print(df)

浓缩:

rx_school = re.compile(r'^Schools*=s*(?P<school_name>.+)(?P<school_content>[sS]+?)(?=^School|Z)', re.MULTILINE)rx_grade = re.compile(r'^Grades*=s*(?P<grade>.+)(?P<students>[sS]+?)(?=^Grade|Z)', re.MULTILINE)rx_student_score = re.compile(r'^Student number, Name[nr](?P<student_names>(?:^d+.+[nr])+)s*^Student number, Score[nr](?P<student_scores>(?:^d+.+[nr])+)', re.MULTILINE)

这产生

 School Grade Student number      Name Score0   Riverdale High     1   0    Phoebe     31   Riverdale High     1   1    Rachel     72   Riverdale High     2   0    Angela     63   Riverdale High     2   1   Tristan     34   Riverdale High     2   2    Aurora     95         Hogwarts     1   0     Ginny     86         Hogwarts     1   1      Luna     77         Hogwarts     2   0     Harry     58         Hogwarts     2   1  Hermione    109         Hogwarts     3   0      Fred     010        Hogwarts     3   1    George     0

至于 计时 ,这是运行它一万次的结果:

import timeitprint(timeit.timeit(makedf, number=10**4))# 11.918397722000009 s



转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/651119.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号