[假设您需要这个,因为它是一些需要regexp的奇怪的第三方系统]
新的方法
我对弗雷德里克的评论的思考越多,我就越同意。即使输入字符串很长,regexp引擎也应该能够将其编译为紧凑的DFA。在许多情况下,以下是一个明智的解决方案:
import redef regexp(lo, hi): fmt = '%%0%dd' % len(str(hi)) return re.compile('(%s)' % '|'.join(fmt % i for i in range(lo, hi+1)))(它适用于以下测试中的所有数值范围,包括99519000-99519099。粗略的信封计算表明9位数字约为1GB内存的限制。匹配;如果只有几个匹配,则可以更大。)
旧方法
[再次更新以提供更短的结果-除了合并偶尔出现
dd的效果与手工生成的效果差不多之外]
假设所有数字的长度相同(即,如有必要,则在左侧使用零填充),这可以实现:
import redef alt(*args): '''format regexp alternatives''' if len(args) == 1: return args[0] else: return '(%s)' % '|'.join(args)def replace(s, c): '''replace all characters in a string with a different character''' return ''.join(map(lambda x: c, s))def repeat(s, n): '''format a regexp repeat''' if n == 0: return '' elif n == 1: return s else: return '%s{%d}' % (s, n)def digits(lo, hi): '''format a regexp digit range''' if lo == 0 and hi == 9: return r'd' elif lo == hi: return str(lo) else: return '[%d-%d]' % (lo, hi)def trace(f): '''for debugging''' def wrapped(lo, hi): result = f(lo, hi) print(lo, hi, result) return result return wrapped#@trace # uncomment to get calls traced to stdout (explains recursion when bug hunting)def regexp(lo, hi): '''generate a regexp that matches integers from lo to hi only. assumes that inputs are zero-padded to the length of hi (like phone numbers). you probably want to surround with ^ and $ before using.''' assert lo <= hi assert lo >= 0 slo, shi = str(lo), str(hi) # zero-pad to same length while len(slo) < len(shi): slo = '0' + slo # first digits and length l, h, n = int(slo[0]), int(shi[0]), len(slo) if l == h: # extract common prefix common = '' while slo and slo[0] == shi[0]: common += slo[0] slo, shi = slo[1:], shi[1:] if slo: return common + regexp(int(slo), int(shi)) else: return common else: # the core of the routine. # split into 'complete blocks' like 200-599 and 'edge cases' like 123-199 # and handle each separately. # are these complete blocks? xlo = slo[1:] == replace(slo[1:], '0') xhi = shi[1:] == replace(shi[1:], '9') # edges of possible complete blocks mlo = int(slo[0] + replace(slo[1:], '9')) mhi = int(shi[0] + replace(shi[1:], '0')) if xlo: if xhi: # complete block on both sides # this is where single digits are finally handled, too. return digits(l, h) + repeat('d', n-1) else: # complete block to mhi, plus extra on hi side prefix = '' if l or h-1 else '0' return alt(prefix + regexp(lo, mhi-1), regexp(mhi, hi)) else: prefix = '' if l else '0' if xhi: # complete block on hi side plus extra on lo return alt(prefix + regexp(lo, mlo), regexp(mlo+1, hi)) else: # neither side complete, so add extra on both sides # (and maybe a complete block in the middle, if room) if mlo + 1 == mhi: return alt(prefix + regexp(lo, mlo), regexp(mhi, hi)) else: return alt(prefix + regexp(lo, mlo), regexp(mlo+1, mhi-1), regexp(mhi, hi))# test a bunch of different rangesfor (lo, hi) in [(0, 0), (0, 1), (0, 2), (0, 9), (0, 10), (0, 11), (0, 101), (1, 1), (1, 2), (1, 9), (1, 10), (1, 11), (1, 101), (0, 123), (111, 123), (123, 222), (123, 333), (123, 444), (0, 321), (111, 321), (222, 321), (321, 333), (321, 444), (123, 321), (111, 121), (121, 222), (1234, 4321), (0, 999), (99519000, 99519099)]: fmt = '%%0%dd' % len(str(hi)) rx = regexp(lo, hi) print('%4s - %-4s %s' % (fmt % lo, fmt % hi, rx)) m = re.compile('^%s$' % rx) for i in range(0, 1+int(replace(str(hi), '9'))): if m.match(fmt % i): assert lo <= i <= hi, i else: assert i < lo or i > hi, i该函数
regexp(lo,hi)会建立一个正则表达式,以匹配介于
lo和之间的值
hi(零填充到最大长度)。您可能需要在
^前后放置一个
$字符串(如测试代码中所示),以将匹配项强制为整个字符串。
该算法实际上非常简单-
它以递归方式将事物划分为通用前缀和“完整块”。完整的区块类似于200-599,并且可以可靠地匹配(在这种情况下为
[2-5]d{2})。因此123-599被分为123-199和200-599。后半部分是一个完整的块,上半部分有一个公共前缀1和23-99,将其递归处理为23-29(公共前缀)和30-99(完整块)(我们最终终止,因为参数每个通话的时间都比初始输入短)。
唯一令人讨厌的细节是
prefix,这是必需的,因为to的参数
regexp()是整数,因此在调用生成00-09的regexp时,实际上会生成0-9的regexp,而前导0除外。
输出是一堆测试用例,显示了范围和正则表达式:
0 - 0 0 0 - 1 [0-1] 0 - 2 [0-2] 0 - 9 d 00 - 10 (0d|10) 00 - 11 (0d|1[0-1]) 000 - 101 (0dd|10[0-1]) 1 - 1 1 1 - 2 [1-2] 1 - 9 [1-9] 01 - 10 (0[1-9]|10) 01 - 11 (0[1-9]|1[0-1]) 001 - 101 (0(0[1-9]|[1-9]d)|10[0-1]) 000 - 123 (0dd|1([0-1]d|2[0-3])) 111 - 123 1(1[1-9]|2[0-3]) 123 - 222 (1(2[3-9]|[3-9]d)|2([0-1]d|2[0-2])) 123 - 333 (1(2[3-9]|[3-9]d)|2dd|3([0-2]d|3[0-3])) 123 - 444 (1(2[3-9]|[3-9]d)|[2-3]d{2}|4([0-3]d|4[0-4])) 000 - 321 ([0-2]d{2}|3([0-1]d|2[0-1])) 111 - 321 (1(1[1-9]|[2-9]d)|2dd|3([0-1]d|2[0-1])) 222 - 321 (2(2[2-9]|[3-9]d)|3([0-1]d|2[0-1])) 321 - 333 3(2[1-9]|3[0-3]) 321 - 444 (3(2[1-9]|[3-9]d)|4([0-3]d|4[0-4])) 123 - 321 (1(2[3-9]|[3-9]d)|2dd|3([0-1]d|2[0-1])) 111 - 121 1(1[1-9]|2[0-1]) 121 - 222 (1(2[1-9]|[3-9]d)|2([0-1]d|2[0-2]))1234 - 4321 (1(2(3[4-9]|[4-9]d)|[3-9]d{2})|[2-3]d{3}|4([0-2]d{2}|3([0-1]d|2[0-1]))) 000 - 999 dd{2}99519000 - 99519099 995190dd最后的测试循环超过99999999个数字需要一段时间。
表达式应该足够紧凑以避免任何缓冲区限制(我想最坏情况下的内存大小与最大数位的平方成正比)。
PS:我使用的是python 3,但我认为在这里没有太大的区别。



