python：正则表达式匹配字符串的数字范围

[假设您需要这个，因为它是一些需要regexp的奇怪的第三方系统]

新的方法

我对弗雷德里克的评论的思考越多，我就越同意。即使输入字符串很长，regexp引擎也应该能够将其编译为紧凑的DFA。在许多情况下，以下是一个明智的解决方案：

import redef regexp(lo, hi):    fmt = '%%0%dd' % len(str(hi))    return re.compile('(%s)' % '|'.join(fmt % i for i in range(lo, hi+1)))

（它适用于以下测试中的所有数值范围，包括99519000-99519099。粗略的信封计算表明9位数字约为1GB内存的限制。匹配；如果只有几个匹配，则可以更大。）

旧方法

[再次更新以提供更短的结果-除了合并偶尔出现

dd

的效果与手工生成的效果差不多之外]

假设所有数字的长度相同（即，如有必要，则在左侧使用零填充），这可以实现：

import redef alt(*args):    '''format regexp alternatives'''    if len(args) == 1: return args[0]    else: return '(%s)' % '|'.join(args)def replace(s, c):      '''replace all characters in a string with a different character'''    return ''.join(map(lambda x: c, s))def repeat(s, n):    '''format a regexp repeat'''    if n == 0: return ''    elif n == 1: return s    else: return '%s{%d}' % (s, n)def digits(lo, hi):     '''format a regexp digit range'''    if lo == 0 and hi == 9: return r'd'    elif lo == hi: return str(lo)    else: return '[%d-%d]' % (lo, hi)def trace(f):    '''for debugging'''    def wrapped(lo, hi):        result = f(lo, hi)        print(lo, hi, result)        return result    return wrapped#@trace  # uncomment to get calls traced to stdout (explains recursion when bug hunting)def regexp(lo, hi):    '''generate a regexp that matches integers from lo to hi only.       assumes that inputs are zero-padded to the length of hi (like phone numbers).       you probably want to surround with ^ and $ before using.'''    assert lo <= hi    assert lo >= 0    slo, shi = str(lo), str(hi)    # zero-pad to same length    while len(slo) < len(shi): slo = '0' + slo    # first digits and length    l, h, n = int(slo[0]), int(shi[0]), len(slo)    if l == h:        # extract common prefix        common = ''        while slo and slo[0] == shi[0]: common += slo[0] slo, shi = slo[1:], shi[1:]        if slo: return common + regexp(int(slo), int(shi))        else: return common    else:        # the core of the routine.        # split into 'complete blocks' like 200-599 and 'edge cases' like 123-199        # and handle each separately.        # are these complete blocks?        xlo = slo[1:] == replace(slo[1:], '0')        xhi = shi[1:] == replace(shi[1:], '9')        # edges of possible complete blocks        mlo = int(slo[0] + replace(slo[1:], '9'))        mhi = int(shi[0] + replace(shi[1:], '0'))        if xlo: if xhi:     # complete block on both sides     # this is where single digits are finally handled, too.     return digits(l, h) + repeat('d', n-1) else:     # complete block to mhi, plus extra on hi side     prefix = '' if l or h-1 else '0'     return alt(prefix + regexp(lo, mhi-1), regexp(mhi, hi))        else: prefix = '' if l else '0' if xhi:     # complete block on hi side plus extra on lo     return alt(prefix + regexp(lo, mlo), regexp(mlo+1, hi)) else:     # neither side complete, so add extra on both sides     # (and maybe a complete block in the middle, if room)     if mlo + 1 == mhi:         return alt(prefix + regexp(lo, mlo), regexp(mhi, hi))     else:         return alt(prefix + regexp(lo, mlo), regexp(mlo+1, mhi-1), regexp(mhi, hi))# test a bunch of different rangesfor (lo, hi) in [(0, 0), (0, 1), (0, 2), (0, 9), (0, 10), (0, 11), (0, 101),      (1, 1), (1, 2), (1, 9), (1, 10), (1, 11), (1, 101),      (0, 123), (111, 123), (123, 222), (123, 333), (123, 444),      (0, 321), (111, 321), (222, 321), (321, 333), (321, 444),      (123, 321), (111, 121), (121, 222), (1234, 4321), (0, 999),      (99519000, 99519099)]:    fmt = '%%0%dd' % len(str(hi))    rx = regexp(lo, hi)    print('%4s - %-4s  %s' % (fmt % lo, fmt % hi, rx))    m = re.compile('^%s$' % rx)    for i in range(0, 1+int(replace(str(hi), '9'))):        if m.match(fmt % i): assert lo <= i <= hi, i        else: assert i < lo or i > hi, i

该函数

regexp(lo,hi)

会建立一个正则表达式，以匹配介于

lo

和之间的值

hi

（零填充到最大长度）。您可能需要在

前后放置一个

字符串（如测试代码中所示），以将匹配项强制为整个字符串。

该算法实际上非常简单-
它以递归方式将事物划分为通用前缀和“完整块”。完整的区块类似于200-599，并且可以可靠地匹配（在这种情况下为

[2-5]d{2}

）。

因此123-599被分为123-199和200-599。后半部分是一个完整的块，上半部分有一个公共前缀1和23-99，将其递归处理为23-29（公共前缀）和30-99（完整块）（我们最终终止，因为参数每个通话的时间都比初始输入短）。

唯一令人讨厌的细节是

prefix

，这是必需的，因为to的参数

regexp()

是整数，因此在调用生成00-09的regexp时，实际上会生成0-9的regexp，而前导0除外。

输出是一堆测试用例，显示了范围和正则表达式：

   0 - 0     0   0 - 1     [0-1]   0 - 2     [0-2]   0 - 9     d  00 - 10    (0d|10)  00 - 11    (0d|1[0-1]) 000 - 101   (0dd|10[0-1])   1 - 1     1   1 - 2     [1-2]   1 - 9     [1-9]  01 - 10    (0[1-9]|10)  01 - 11    (0[1-9]|1[0-1]) 001 - 101   (0(0[1-9]|[1-9]d)|10[0-1]) 000 - 123   (0dd|1([0-1]d|2[0-3])) 111 - 123   1(1[1-9]|2[0-3]) 123 - 222   (1(2[3-9]|[3-9]d)|2([0-1]d|2[0-2])) 123 - 333   (1(2[3-9]|[3-9]d)|2dd|3([0-2]d|3[0-3])) 123 - 444   (1(2[3-9]|[3-9]d)|[2-3]d{2}|4([0-3]d|4[0-4])) 000 - 321   ([0-2]d{2}|3([0-1]d|2[0-1])) 111 - 321   (1(1[1-9]|[2-9]d)|2dd|3([0-1]d|2[0-1])) 222 - 321   (2(2[2-9]|[3-9]d)|3([0-1]d|2[0-1])) 321 - 333   3(2[1-9]|3[0-3]) 321 - 444   (3(2[1-9]|[3-9]d)|4([0-3]d|4[0-4])) 123 - 321   (1(2[3-9]|[3-9]d)|2dd|3([0-1]d|2[0-1])) 111 - 121   1(1[1-9]|2[0-1]) 121 - 222   (1(2[1-9]|[3-9]d)|2([0-1]d|2[0-2]))1234 - 4321  (1(2(3[4-9]|[4-9]d)|[3-9]d{2})|[2-3]d{3}|4([0-2]d{2}|3([0-1]d|2[0-1]))) 000 - 999   dd{2}99519000 - 99519099  995190dd

最后的测试循环超过99999999个数字需要一段时间。

表达式应该足够紧凑以避免任何缓冲区限制（我想最坏情况下的内存大小与最大数位的平方成正比）。

PS：我使用的是python 3，但我认为在这里没有太大的区别。

python：正则表达式匹配字符串的数字范围

面试问答相关栏目本月热门文章