如何有效解析固定宽度的文件？

由于Python标准库的

struct

模块是用C编写的，因此使用它非常容易而且非常快捷。

这是可以用来完成您想要的事情的方法。通过为字段中的字符数指定负值，还可以跳过字符列。

import structfieldwidths = (2, -10, 24)  # negative widths represent ignored padding fieldsfmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')  for fw in fieldwidths)fieldstruct = struct.Struct(fmtstring)parse = fieldstruct.unpack_fromprint('fmtstring: {!r}, recsize: {} chars'.format(fmtstring, fieldstruct.size))line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789n'fields = parse(line)print('fields: {}'.format(fields))

输出：

fmtstring: '2s 10x 24s', recsize: 36 charsfields: ('AB', 'MNOPQRSTUVWXYZ0123456789')

以下修改将使其适应于Python 2或3（并处理Unipre输入）：

import structimport sysfieldstruct = struct.Struct(fmtstring)if sys.version_info[0] < 3:    parse = fieldstruct.unpack_fromelse:    # converts unipre input to byte string and results back to unipre string    unpack = fieldstruct.unpack_from    parse = lambda line: tuple(s.depre() for s in unpack(line.enpre()))

正如您所考虑的那样，这是一种处理字符串切片的方法，但担心它可能变得太丑陋。关于它的好处是，除了不那么丑陋之外，它还可以在Python
2和3中保持不变，并且能够处理Unipre字符串。在速度方面，它当然比基于

struct

模块的版本慢，但是可以通过删除具有填充字段的功能来稍微加快速度。

try:    from itertools import izip_longest  # added in Py 2.6except importError:    from itertools import zip_longest as izip_longest  # name change in Py 3.xtry:    from itertools import accumulate  # added in Py 3.2except importError:    def accumulate(iterable):        'Return running totals (simplified version).'        total = next(iterable)        yield total        for value in iterable: total += value yield totaldef make_parser(fieldwidths):    cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths))    pads = tuple(fw < 0 for fw in fieldwidths) # bool values for padding fields    flds = tuple(izip_longest(pads, (0,)+cuts, cuts))[:-1]  # ignore final one    parse = lambda line: tuple(line[i:j] for pad, i, j in flds if not pad)    # optional informational function attributes    parse.size = sum(abs(fw) for fw in fieldwidths)    parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')    for fw in fieldwidths)    return parseline = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789n'fieldwidths = (2, -10, 24)  # negative widths represent ignored padding fieldsparse = make_parser(fieldwidths)fields = parse(line)print('format: {!r}, rec size: {} chars'.format(parse.fmtstring, parse.size))print('fields: {}'.format(fields))

输出：

format: '2s 10x 24s', rec size: 36 charsfields: ('AB', 'MNOPQRSTUVWXYZ0123456789')

如何有效解析固定宽度的文件？

面试问答相关栏目本月热门文章