由于正则表达式是由有限状态机定义的,所以我想知道是否有某种东西可以在此类机器上自动推理,并且是否适合将其用于这项工作……并提供了clojure.core.logic
因此,我查看了正则表达式语法的定义(不幸的是,它缺少{}量词,但它们应该很容易添加到我的代码中)使其适应了Java转义,并得出了以下110行长的clojure程序:
(ns regexp-unfolder.core (:require [instaparse.core :as insta]) (:require [clojure.core.logic :as l]) (:require [clojure.set :refer [union difference]]) (:gen-class :methods [#^{:static true} [unfold [String] clojure.lang.LazySeq]]))(def parse-regexp (insta/parser "re = union | simple-re? union = re '|' simple-re simple-re = concat | base-re concat = simple-re base-re base-re = elementary-re | star | plus star = elementary-re '*' plus = elementary-re '+' elementary-re = group | char | '$' | any | set any = '.' group = '(' re ')' set = positive-set | negative-set positive-set = '[' set-items ']' negative-set = '[^' set-items ']' set-items = set-item* set-item = range | char range = char '-' char char = #'[^\\\-\[\]]|\.'" ))(def printables (set (map char (range 32 127))))(declare fns handle-first)(defn handle-tree [q qto [ type & nodes]] (if (nil? nodes) [[q [""] qto]] ((fns type handle-first) q qto nodes)))(defn star [q qto node &] (cons [q [""] qto] (handle-tree q q (first node))))(defn plus [q qto node &] (concat (handle-tree q qto (first node)) (handle-tree qto qto (first node))))(defn any-char [q qto & _] [[q (vec printables) qto]] )(defn char-range [[c1 _ c2]] (let [extract-char (comp int first seq second)] (set (map char (range (extract-char c1) (inc (extract-char c2)))))))(defn items [nodes] (union (mapcat (fn [[_ [type & ns]]] (if (= type :char) #{(first ns)} (char-range ns))) (rest (second nodes)))))(defn handle-set [q qto node &] [[q (vec (items node)) qto]])(defn handle-negset [q qto node &] [[q (vec (difference printables (items node))) qto]])(defn handle-range [q qto & nodes] [[q (vec (char-range nodes)) qto]])(defn handle-char [q qto node &] [[q (vec node) qto]] )(defn handle-concat [q qto nodes] (let [syms (for [x (rest nodes)] (gensym q))] (mapcat handle-tree (cons q syms) (concat syms [qto] ) nodes) ))(defn handle-first [q qto [node & _]] (handle-tree q qto node))(def fns {:concat handle-concat, :star star, :plus plus, :any any-char, :positive-set handle-set, :negative-set handle-negset, :char handle-char})(l/defne transition-membero [state trans newstate otransition] ([_ _ _ [state trans-set newstate]] (l/membero trans trans-set)))(defn transitiono [state trans newstate transitions] (l/conde [(l/fresh [f] (l/firsto transitions f) (transition-membero state trans newstate f))] [(l/fresh [r] (l/resto transitions r) (transitiono state trans newstate r))]) )(declare transitions);; Recognize a regexp finite state machine enpred in triplets [state, transition, next-state], adapted from a snippet made by Peteris Erins(defn recognizeo ([input] (recognizeo 'q0 input)) ([q input] (l/matche [input] ; start pattern matching on the input (['("")](l/== q 'ok)) ; accept the empty string if we are in an accepting state ([[i . nput]](l/fresh [qto] (transitiono q i qto transitions) ; assert it must be what we transition to qto from q with input symbol i (recognizeo qto nput)))))) ; recognize the remainder(defn -unfold [regex] (def transitions (handle-tree 'q0 'ok (parse-regexp regex))) (map (partial apply str) (l/run* [q] (recognizeo q))))用core.logic编写,使其适应正则表达式匹配器应该相当容易
我将可打印字符的范围限制在32到126个ascii之间,否则处理regexp这样的正则表达式会很麻烦
[^c],但是您可以很轻松地扩展它……而且,我还没有实现联合,可选模式和
w, s等字符类的转义符
到目前为止,这是我在clojure中编写的最大的事情,但是基本知识似乎还不错。
regexp-unfolder.core=> (-unfold "ba[rz]")("bar" "baz")regexp-unfolder.core=> (-unfold "[a-z3-7]")("a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z" "3" "4" "5" "6" "7")regexp-unfolder.core=> (-unfold "[a-z3-7][01]")("a0" "a1" "b0" "b1" "c0" "c1" "d0" "d1" "e0" "e1" "f0" "f1" "g0" "g1" "h0" "h1" "i0" "i1" "j0" "j1" "k0" "k1" "l0" "l1" "m0" "m1" "n0" "n1" "o0" "o1" "p0" "p1" "q0" "q1" "r0" "r1" "s0" "s1" "t0" "t1" "u0" "u1" "v0" "v1" "w0" "w1" "x0" "x1" "y0" "y1" "z0" "z1" "30" "31" "40" "41" "50" "51" "60" "70" "61" "71")regexp-unfolder.core=> (-unfold "[^A-z]")(" " "@" "!" """ "#" "$" "%" "&" "'" "(" ")" "*" "+" "," "-" "." "/" "0" "1" "2" "3" "4" "5" "6" "7" "8" "9" ":" ";" "{" "<" "|" "=" "}" ">" "~" "?")regexp-unfolder.core=> (take 20 (-unfold "[abc]*"))("" "a" "b" "c" "aa" "ab" "ac" "ba" "ca" "aaa" "bb" "cb" "aab" "bc" "cc" "aac" "aba" "aca" "baa" "caa")regexp-unfolder.core=> (take 20 (-unfold "a+b+"))("ab" "aab" "abb" "abbb" "aaab" "abbbb" "aabb" "abbbbb" "abbbbbb" "aabbb" "abbbbbbb" "abbbbbbbb" "aaaab" "aabbbb" "aaabb" "abbbbbbbbb" "abbbbbbbbbb" "aabbbbb" "abbbbbbbbbbb" "abbbbbbbbbbbb")自从我开始这种方式以来,我还实现了无限的输出:)
如果有人感兴趣,我在这里上传了
显然,这是一个如何
unfold从普通的旧Java 调用的示例:
import static regexp_unfolder.core.unfold;public class UnfolderExample{ public static void main(String[] args){ @SuppressWarnings("unchecked") Iterable<String> strings = unfold("a+b+"); for (String s : strings){ System.out.println(s); } }}


