如何使用Python解析复杂的文本文件？

更新2019（PEG解析器）：

这个答案已经引起了相当多的关注，因此我感到增加了另一种可能性，即解析选项。在这里，我们可以将

PEG

解析器（例如
parsimonious

）与

NodeVisitor

类结合使用：

from parsimonious.grammar import Grammarfrom parsimonious.nodes import NodeVisitorimport pandas as pdgrammar = Grammar(    r"""    schools         = (school_block / ws)+    school_block    = school_header ws grade_block+     grade_block     = grade_header ws name_header ws (number_name)+ ws score_header ws (number_score)+ ws?    school_header   = ~"^School = (.*)"m    grade_header    = ~"^Grade = (d+)"m    name_header     = "Student number, Name"    score_header    = "Student number, Score"    number_name     = index comma name ws    number_score    = index comma score ws    comma= ws? "," ws?    index= number+    score= number+    number          = ~"d+"    name = ~"[A-Z]w+"    ws   = ~"s*"    """)tree = grammar.parse(data)class SchoolVisitor(NodeVisitor):    output, names = ([], [])    current_school, current_grade = None, None    def _getName(self, idx):        for index, name in self.names: if index == idx:     return name    def generic_visit(self, node, visited_children):        return node.text or visited_children    def visit_school_header(self, node, children):        self.current_school = node.match.group(1)    def visit_grade_header(self, node, children):        self.current_grade = node.match.group(1)        self.names = []    def visit_number_name(self, node, children):        index, name = None, None        for child in node.children: if child.expr.name == 'name':     name = child.text elif child.expr.name == 'index':     index = child.text        self.names.append((index, name))    def visit_number_score(self, node, children):        index, score = None, None        for child in node.children: if child.expr.name == 'index':     index = child.text elif child.expr.name == 'score':     score = child.text        name = self._getName(index)        # build the entire entry        entry = (self.current_school, self.current_grade, index, name, score)        self.output.append(entry)sv = SchoolVisitor()sv.visit(tree)df = pd.Dataframe.from_records(sv.output, columns = ['School', 'Grade', 'Student number', 'Name', 'Score'])print(df)

正则表达式选项（原始答案）

好吧，我第十次看了《指环王》，所以我不得不在最后的结局上花些时间：

分解而言，该想法是将问题分解为几个较小的问题：

分开每所学校
…每个年级
…学生和分数
…然后将它们绑定到数据框中

学校部分（请参阅 regex101.com上的演示 ）

^Schools*=s*(?P<school_name>.+)(?P<school_content>[sS]+?)(?=^School|Z)

成绩部分（ regex101.com上的另一个演示 ）

^Grades*=s*(?P<grade>.+)(?P<students>[sS]+?)(?=^Grade|Z)

学生/分数部分（ regex101.com上的最后一个演示 ）：

^Student number, Name[nr](?P<student_names>(?:^d+.+[nr])+)s*^Student number, Score[nr](?P<student_scores>(?:^d+.+[nr])+)

其余的是生成器表达式，然后将其馈送到

Dataframe

构造函数中（以及列名）。

编码：

import pandas as pd, rerx_school = re.compile(r'''    ^    Schools*=s*(?P<school_name>.+)    (?P<school_content>[sS]+?)    (?=^School|Z)''', re.MULTILINE | re.VERBOSE)rx_grade = re.compile(r'''    ^    Grades*=s*(?P<grade>.+)    (?P<students>[sS]+?)    (?=^Grade|Z)''', re.MULTILINE | re.VERBOSE)rx_student_score = re.compile(r'''    ^    Student number, Name[nr]    (?P<student_names>(?:^d+.+[nr])+)    s*    ^    Student number, Score[nr]    (?P<student_scores>(?:^d+.+[nr])+)''', re.MULTILINE | re.VERBOSE)result = ((school.group('school_name'), grade.group('grade'), student_number, name, score)    for school in rx_school.finditer(string)    for grade in rx_grade.finditer(school.group('school_content'))    for student_score in rx_student_score.finditer(grade.group('students'))    for student in zip(student_score.group('student_names')[:-1].split("n"), student_score.group('student_scores')[:-1].split("n"))    for student_number in [student[0].split(", ")[0]]    for name in [student[0].split(", ")[1]]    for score in [student[1].split(", ")[1]])df = pd.Dataframe(result, columns = ['School', 'Grade', 'Student number', 'Name', 'Score'])print(df)

浓缩：

rx_school = re.compile(r'^Schools*=s*(?P<school_name>.+)(?P<school_content>[sS]+?)(?=^School|Z)', re.MULTILINE)rx_grade = re.compile(r'^Grades*=s*(?P<grade>.+)(?P<students>[sS]+?)(?=^Grade|Z)', re.MULTILINE)rx_student_score = re.compile(r'^Student number, Name[nr](?P<student_names>(?:^d+.+[nr])+)s*^Student number, Score[nr](?P<student_scores>(?:^d+.+[nr])+)', re.MULTILINE)

这产生

 School Grade Student number      Name Score0   Riverdale High     1   0    Phoebe     31   Riverdale High     1   1    Rachel     72   Riverdale High     2   0    Angela     63   Riverdale High     2   1   Tristan     34   Riverdale High     2   2    Aurora     95         Hogwarts     1   0     Ginny     86         Hogwarts     1   1      Luna     77         Hogwarts     2   0     Harry     58         Hogwarts     2   1  Hermione    109         Hogwarts     3   0      Fred     010        Hogwarts     3   1    George     0

至于计时，这是运行它一万次的结果：

import timeitprint(timeit.timeit(makedf, number=10**4))# 11.918397722000009 s

如何使用Python解析复杂的文本文件？

更新2019（PEG解析器）：

正则表达式选项（原始答案）

面试问答相关栏目本月热门文章