看起来您正在使用
text数据类型进行存储
Unit.DailyAvailability(如果使用动态映射,这也是字符串的默认值)。您应该考虑改用
keyword数据类型。
让我详细解释一下。
为什么我的正则表达式与text
字段中间的内容匹配?
text数据类型所发生的是,对数据进行了分析以进行全文搜索。它进行了一些转换,例如降低大小写并拆分为令牌。
让我们尝试对您的输入使用Analyze
API:
POST _analyze{ "text": "UIAOUUUUUUUIAAAAAAAAAAAAAAAAAOUUUUIAAAAOUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUIAAAAAOUUUUUUUUUUUUUIAAAAOUUUUUUUUUUUUUIAAAAAAAAOUUUUUUIAAAAAAAAAOUUUUUUUUUUUUUUUUUUIUUUUUUUUIUUUUUUUUUUUUUUIAAAOUUUUUUUUUUUUUIUUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"}响应为:
{ "tokens": [ { "token": "uiaouuuuuuuiaaaaaaaaaaaaaaaaaouuuuiaaaaouuuiaouuuuuuuuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaouuuuuuuuuuiaaaaaouuuuuuuuuuuuuiaaaaouuuuuuuuuuuuuiaaaaaaaaouuuuuuiaaaaaaaaaouuuuuuuuuuuuuuuuuuiuuuuuuuuiuuuuuuuuuuuuuuiaaaouuuuuuuuuuuuuiuuuuiaouuuuuuuuuuuuuuu", "start_offset": 0, "end_offset": 255, "type": "<ALPHANUM>", "position": 0 }, { "token": "uuuuuuuuuuuuuuiaaaaaaaaaaaaouuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", "start_offset": 255, "end_offset": 510, "type": "<ALPHANUM>", "position": 1 }, { "token": "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaouuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", "start_offset": 510, "end_offset": 732, "type": "<ALPHANUM>", "position": 2 } ]}如您所见,Elasticsearch将您的输入分为三个标记并将它们小写。这看起来是出乎意料的,但是如果您认为它实际上是试图促进人类语言单词的搜索,那是有道理的-
没有那么长的单词。
这就是为什么现在
regexp查询
".{7}a{7}.*"将匹配:有一个标记,实际上有很多开始a的,这是一种预期行为的
regexp查询。
… Elasticsearch会将正则表达式应用于令牌生成器针对该字段生成的术语,而不应用于该字段的原始文本。
如何使regexp
查询考虑整个字符串?
这很简单:不要使用分析仪。该类型
keyword按原样存储您提供的字符串。
使用这样的映射:
PUT my_regexes{ "mappings": { "doc": { "properties": { "Unit": { "properties": { "DailyAvailablity": { "type": "keyword" } } } } } }}您将可以进行如下查询,以匹配帖子中的文档:
POST my_regexes/doc/_search{ "query": { "bool": { "filter": [ { "regexp": { "Unit.DailyAvailablity": "UIAOUUUUUUUIA.*" } } ] } }}请注意,查询变得区分大小写,因为未分析该字段。
这
regexp将不再返回任何结果:
".{12}a{7}.*"这将:
".{12}A{7}.*"那么锚定呢?
正则表达式锚定:
Lucene的模式始终是固定的。提供的模式必须与整个字符串匹配。
看起来锚定错误的原因很可能是因为令牌在分析的
text字段中分裂了。
希望有帮助!



