更新2019(PEG解析器):
这个答案已经引起了相当多的关注,因此我感到增加了另一种可能性,即解析选项。在这里,我们可以将
PEG解析器(例如
parsimonious
)与
NodeVisitor类结合使用:
from parsimonious.grammar import Grammarfrom parsimonious.nodes import NodeVisitorimport pandas as pdgrammar = Grammar( r""" schools = (school_block / ws)+ school_block = school_header ws grade_block+ grade_block = grade_header ws name_header ws (number_name)+ ws score_header ws (number_score)+ ws? school_header = ~"^School = (.*)"m grade_header = ~"^Grade = (d+)"m name_header = "Student number, Name" score_header = "Student number, Score" number_name = index comma name ws number_score = index comma score ws comma= ws? "," ws? index= number+ score= number+ number = ~"d+" name = ~"[A-Z]w+" ws = ~"s*" """)tree = grammar.parse(data)class SchoolVisitor(NodeVisitor): output, names = ([], []) current_school, current_grade = None, None def _getName(self, idx): for index, name in self.names: if index == idx: return name def generic_visit(self, node, visited_children): return node.text or visited_children def visit_school_header(self, node, children): self.current_school = node.match.group(1) def visit_grade_header(self, node, children): self.current_grade = node.match.group(1) self.names = [] def visit_number_name(self, node, children): index, name = None, None for child in node.children: if child.expr.name == 'name': name = child.text elif child.expr.name == 'index': index = child.text self.names.append((index, name)) def visit_number_score(self, node, children): index, score = None, None for child in node.children: if child.expr.name == 'index': index = child.text elif child.expr.name == 'score': score = child.text name = self._getName(index) # build the entire entry entry = (self.current_school, self.current_grade, index, name, score) self.output.append(entry)sv = SchoolVisitor()sv.visit(tree)df = pd.Dataframe.from_records(sv.output, columns = ['School', 'Grade', 'Student number', 'Name', 'Score'])print(df)
正则表达式选项(原始答案)
好吧,我第十次看了《指环王》,所以我不得不在最后的结局上花些时间:
分解而言,该想法是将问题分解为几个较小的问题:
- 分开每所学校
- …每个年级
- …学生和分数
- …然后将它们绑定到数据框中
学校部分(请参阅 regex101.com上的演示 )
^Schools*=s*(?P<school_name>.+)(?P<school_content>[sS]+?)(?=^School|Z)
成绩部分( regex101.com上的另一个演示 )
^Grades*=s*(?P<grade>.+)(?P<students>[sS]+?)(?=^Grade|Z)
学生/分数部分( regex101.com上的最后一个演示 ):
^Student number, Name[nr](?P<student_names>(?:^d+.+[nr])+)s*^Student number, Score[nr](?P<student_scores>(?:^d+.+[nr])+)
其余的是生成器表达式,然后将其馈送到
Dataframe构造函数中(以及列名)。
编码:
import pandas as pd, rerx_school = re.compile(r''' ^ Schools*=s*(?P<school_name>.+) (?P<school_content>[sS]+?) (?=^School|Z)''', re.MULTILINE | re.VERBOSE)rx_grade = re.compile(r''' ^ Grades*=s*(?P<grade>.+) (?P<students>[sS]+?) (?=^Grade|Z)''', re.MULTILINE | re.VERBOSE)rx_student_score = re.compile(r''' ^ Student number, Name[nr] (?P<student_names>(?:^d+.+[nr])+) s* ^ Student number, Score[nr] (?P<student_scores>(?:^d+.+[nr])+)''', re.MULTILINE | re.VERBOSE)result = ((school.group('school_name'), grade.group('grade'), student_number, name, score) for school in rx_school.finditer(string) for grade in rx_grade.finditer(school.group('school_content')) for student_score in rx_student_score.finditer(grade.group('students')) for student in zip(student_score.group('student_names')[:-1].split("n"), student_score.group('student_scores')[:-1].split("n")) for student_number in [student[0].split(", ")[0]] for name in [student[0].split(", ")[1]] for score in [student[1].split(", ")[1]])df = pd.Dataframe(result, columns = ['School', 'Grade', 'Student number', 'Name', 'Score'])print(df)浓缩:
rx_school = re.compile(r'^Schools*=s*(?P<school_name>.+)(?P<school_content>[sS]+?)(?=^School|Z)', re.MULTILINE)rx_grade = re.compile(r'^Grades*=s*(?P<grade>.+)(?P<students>[sS]+?)(?=^Grade|Z)', re.MULTILINE)rx_student_score = re.compile(r'^Student number, Name[nr](?P<student_names>(?:^d+.+[nr])+)s*^Student number, Score[nr](?P<student_scores>(?:^d+.+[nr])+)', re.MULTILINE)
这产生
School Grade Student number Name Score0 Riverdale High 1 0 Phoebe 31 Riverdale High 1 1 Rachel 72 Riverdale High 2 0 Angela 63 Riverdale High 2 1 Tristan 34 Riverdale High 2 2 Aurora 95 Hogwarts 1 0 Ginny 86 Hogwarts 1 1 Luna 77 Hogwarts 2 0 Harry 58 Hogwarts 2 1 Hermione 109 Hogwarts 3 0 Fred 010 Hogwarts 3 1 George 0
至于 计时 ,这是运行它一万次的结果:
import timeitprint(timeit.timeit(makedf, number=10**4))# 11.918397722000009 s



