Elasticsearch之内置分析器与自定义分析器

每个被分析字段都会经过一系列的处理步骤：

字符过滤：使用字符过滤器过滤字符，如敏感词，缩写转为全写。
文本切分为分词：将文本切分为单个或多个分词。
分词过滤：使用分词过滤器再次过滤每个分词。

每个分析器基本上都要包含上面三个步骤至少一个。其中字符过滤器可以为0个，也可以为多个，分词器则必须，但是也只能有一个，分词过滤器可以为0个，也可以为多个。

Elasticsearch已经为我们内置了很多的字符过滤器、分词器和分词过滤器，以及分析器。不过常用的就是那么几个。

字符过滤器（Character filters）

字符过滤器种类不多。Elasticearch只提供了三种字符过滤器：

HTML字符过滤器（HTML Strip Char Filter）

从文本中去除HTML元素。

POST _analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    "html_strip"
  ],
  "text": "I'm so happy!"
}
输出结果：
{
  "tokens" : [
    {
      "token" : """
I'm so happy!
""",
      "start_offset" : 0,
      "end_offset" : 32,
      "type" : "word",
      "position" : 0
    }
  ]
}

映射字符过滤器（Mapping Char Filter）

接收键值的映射，每当遇到与键相同的字符串时，它就用该键关联的值替换它们。

put pattern_test4
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": [
            "my_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            "xxx => ooo",
            "yyy => zzz"
          ]
        }
      }
    }
  }
}

上例中，我们自定义了一个分析器，其内的分词器使用关键字分词器，字符过滤器则是使用的映射字符过滤器，将字符中的xxx替换为ooo，yyy替换为zzz。

POST pattern_test4/_analyze
{
  "analyzer": "my_analyzer",
  "text": "xxx love yyy，可惜后来yyy结婚了"
}
输出结果：
{
  "tokens" : [
    {
      "token" : "ooo love zzz，可惜后来zzz结婚了",
      "start_offset" : 0,
      "end_offset" : 23,
      "type" : "word",
      "position" : 0
    }
  ]
}

模式替换过滤器（Pattern Replace Char Filter）

使用正则表达式匹配并替换字符串中的字符。但要小心你写的糟糕的正则表达式。因为这可能导致性能变慢！

POST _analyze
{
  "analyzer": "standard",
  "text": "My credit card is 123-456-789"
}
输出结果：
{
  "tokens" : [
    {
      "token" : "my",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "",
      "position" : 0
    },
    {
      "token" : "credit",
      "start_offset" : 3,
      "end_offset" : 9,
      "type" : "",
      "position" : 1
    },
    {
      "token" : "card",
      "start_offset" : 10,
      "end_offset" : 14,
      "type" : "",
      "position" : 2
    },
    {
      "token" : "is",
      "start_offset" : 15,
      "end_offset" : 17,
      "type" : "",
      "position" : 3
    },
    {
      "token" : "123",
      "start_offset" : 18,
      "end_offset" : 21,
      "type" : "",
      "position" : 4
    },
    {
      "token" : "456",
      "start_offset" : 22,
      "end_offset" : 25,
      "type" : "",
      "position" : 5
    },
    {
      "token" : "789",
      "start_offset" : 26,
      "end_offset" : 29,
      "type" : "",
      "position" : 6
    }
  ]
}

这样分词，会导致123-456-789被分为123、456、789，但是我们希望123-456-789是一个整体，可以使用模式替换过滤器，替换掉“-”。

PUT pattern_test5
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "my_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": """(d+)-(?=d)""",
          "replacement": "$1_"
        }
      }
    }
  }
}

POST pattern_test5/_analyze
{
  "analyzer": "my_analyzer",
  "text": "My credit card is 123-456-789"
}
输出结果：
{
  "tokens" : [
    {
      "token" : "My",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "",
      "position" : 0
    },
    {
      "token" : "credit",
      "start_offset" : 3,
      "end_offset" : 9,
      "type" : "",
      "position" : 1
    },
    {
      "token" : "card",
      "start_offset" : 10,
      "end_offset" : 14,
      "type" : "",
      "position" : 2
    },
    {
      "token" : "is",
      "start_offset" : 15,
      "end_offset" : 17,
      "type" : "",
      "position" : 3
    },
    {
      "token" : "123_456_789",
      "start_offset" : 18,
      "end_offset" : 29,
      "type" : "",
      "position" : 4
    }
  ]
}

把数字中间的“-”替换为下划线“_”，这样的话可以让“123-456-789”作为一个整体，而不至于被分成123、456、789。

分词器（Tokenizer）标准分词器（standard）

标准分词器（standard tokenizer）是一个基于语法的分词器，对于大多数欧洲语言来说是不错的。它还处理了Unicode文本的切分。它也移除了逗号和句号这样的标点符号。

post _analyze 
{
  "tokenizer": "standard",
  "text": "I have, potatoes."
}
输出结果：
{
  "tokens" : [
    {
      "token" : "i",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "",
      "position" : 0
    },
    {
      "token" : "have",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "",
      "position" : 1
    },
    {
      "token" : "potatoes",
      "start_offset" : 8,
      "end_offset" : 16,
      "type" : "",
      "position" : 2
    }
  ]
}

关键词分词器（keyword）

关键词分词器（keyword tokenizer）是一种简单的分词器（啥也不干，直接透传），将整个文本作为单个的分词，提供给分词过滤器。只想应用分词过滤器，而不做任何分词操作时，它可能非常有用。

post _analyze 
{
  "analyzer": "tokenizer",
  "text": "I have, potatoes."
}
输出结果：
{
  "tokens" : [
    {
      "token" : "I have, potatoes.",
      "start_offset" : 0,
      "end_offset" : 17,
      "type" : "word",
      "position" : 0
    }
  ]
}

字母分词器（letter）

字母分词器根据非字母的符号，将文本切分成分词。

post _analyze {  "tokenizer": "letter",  "text": "I have, potatoes."}输出结果：{  "tokens" : [    {      "token" : "I",      "start_offset" : 0,      "end_offset" : 1,      "type" : "word",      "position" : 0    },    {      "token" : "have",      "start_offset" : 2,      "end_offset" : 6,      "type" : "word",      "position" : 1    },    {      "token" : "potatoes",      "start_offset" : 8,      "end_offset" : 16,      "type" : "word",      "position" : 2    }  ]}

小写分词器（lowercase）

小写分词器（lowercase tokenizer）结合了常规的字母分词器和小写分词过滤器（如你所想，它将整个分词转化为小写）的行为。通过1个单独的分词器来实现的主要原因是，1次进行两项操作会获得更好的性能。

post _analyze 
{
  "tokenizer": "lowercase",
  "text": "I have, potatoes."
}
输出结果：
{
  "tokens" : [
    {
      "token" : "i",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "have",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "potatoes",
      "start_offset" : 8,
      "end_offset" : 16,
      "type" : "word",
      "position" : 2
    }
  ]
}

空白分词器（whitespace）

空白分词器（whitespace tokenizer）通过空白来分隔不同的分词，空白包括空格、制表符、换行等。请注意，这种分词器不会删除任何标点符号。

post _analyze 
{
  "tokenizer": "whitespace",
  "text": "I have, potatoes."
}
输出结果：
{
  "tokens" : [
    {
      "token" : "I",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "have,",
      "start_offset" : 2,
      "end_offset" : 7,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "potatoes.",
      "start_offset" : 8,
      "end_offset" : 17,
      "type" : "word",
      "position" : 2
    }
  ]
}

模式分词器（pattern）

模式分词器( patterm tokenizer)允许指定一个任意的模式，将文本切分为分词。被指定的模式应该匹配间隔符号。

PUT test_index5
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": ","
        }
      }
    }
  }
}

POST test_index5/_analyze
{

  "analyzer": "my_analyzer",
  "text": "comma,separated,values"

}
输出结果：
{
  "tokens" : [
    {
      "token" : "comma",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "separated",
      "start_offset" : 6,
      "end_offset" : 15,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "values",
      "start_offset" : 16,
      "end_offset" : 22,
      "type" : "word",
      "position" : 2
    }
  ]
}

UAX URL电子邮件分词器（uax_url_email）

在处理英语单词的时候，标准分词器是非常好的选择。但是，当下存在不少以网站地址和电子邮件地址结束的文本。标准分析器可能在你未注意的地方对其进行了切分。例如，有一个电子邮件地址的样本john.smith@example.com，用标准分词器分析它，切分后：分词是john.smith和example.com。

POST _analyze 
{
  "tokenizer": "standard",
  "text": "john.smith@example.com"
}
输出结果：
{
  "tokens" : [
    {
      "token" : "john.smith",
      "start_offset" : 0,
      "end_offset" : 10,
      "type" : "",
      "position" : 0
    },
    {
      "token" : "example.com",
      "start_offset" : 11,
      "end_offset" : 22,
      "type" : "",
      "position" : 1
    }
  ]
}

POST _analyze 
{
  "tokenizer": "uax_url_email",
  "text": "john.smith@example.com"
}
输出结果：
{
  "tokens" : [
    {
      "token" : "john.smith@example.com",
      "start_offset" : 0,
      "end_offset" : 22,
      "type" : "",
      "position" : 0
    }
  ]
}

它同样将URL切分为不同的部分：http://example.com?q=foo分词是http、example.com、q和foo。UAX URL电子邮件分词器将电子邮件和URL都作为单独的分词进行保留。

POST _analyze 
{
  "tokenizer": "standard",
  "text": "http://example.com?q=foo"
}
输出结果：
{
  "tokens" : [
    {
      "token" : "http",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "",
      "position" : 0
    },
    {
      "token" : "example.com",
      "start_offset" : 7,
      "end_offset" : 18,
      "type" : "",
      "position" : 1
    },
    {
      "token" : "q",
      "start_offset" : 19,
      "end_offset" : 20,
      "type" : "",
      "position" : 2
    },
    {
      "token" : "foo",
      "start_offset" : 21,
      "end_offset" : 24,
      "type" : "",
      "position" : 3
    }
  ]
}

POST _analyze 
{
  "tokenizer": "uax_url_email",
  "text": "http://example.com?q=foo"
}
输出结果：
{
  "tokens" : [
    {
      "token" : "http://example.com?q=foo",
      "start_offset" : 0,
      "end_offset" : 24,
      "type" : "",
      "position" : 0
    }
  ]
}

分词过滤器（Token filters）标准分词过滤器（standard）

不要认为标准分词过滤器（standard token filter）进行了什么复杂的计算，实际上它什么事情也没做。

小写分词过滤器（lowercase）

小写分词过滤器（lowercase token filter）只是做了这件事：将任何经过的分词转换为小写。

长度分词过滤器（length）

长度分词过滤器（length token filter）将长度超出最短和最长限制范围的单词过滤掉。举个例子，如果将min设置为2，并将max设置为8，任何小于2个字符和任何大于8个字符的分词将会被移除。

停用词分词过滤器（stop）

停用词分词过滤器（stop token filter）将停用词从分词流中移除。对于英文而言，这意味着停用词列表中的所有分词都将会被完全移除。用户也可以为这个过滤器指定一个待移除单词的列表。

什么是停用词？

停用词是指在信息检索中，为节省存储空间和提高搜索效率，在处理自然语言数据（或文本）之前或之后会自动过滤掉某些字或词，这些字或词即被称为Stop Words（停用词）。

停用词（Stop Words）大致可分为如下两类：

使用十分广泛，甚至是过于频繁的一些单词。比如英文的“i”、“is”、 “what”，中文的“我”、“就”之类词几乎在每个文档上均会出现，查询这样的词搜索引擎就无法保证能够给出真正相关的搜索结果，难于缩小搜索范围提高搜索结果的准确性，同时还会降低搜索的效率。因此，在真正的工作中，Google和百度等搜索引擎会忽略掉特定的常用词，在搜索的时候，如果我们使用了太多的停用词，也同样有可能无法得到非常精确的结果，甚至是可能大量毫不相关的搜索结果。
文本中出现频率很高，但实际意义又不大的词。这一类主要包括了语气助词、副词、介词、连词等，通常自身并无明确意义，只有将其放入一个完整的句子中才有一定作用的词语。如常见的“的”、“在”、“和”、“接着”之类。

下面是英文的默认停用词列表：a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or; such, that, the, their, then,there, these, they, this, to, was, will, with。

截断分词过滤器、修剪分词过滤器和限制分词数量过滤器

下面3个分词过滤器，通过某种方式限制分词流：

截断分词过滤器（truncate token filter）：允许你通过定制配置中的length参数，截断超过一定长度的分词。默认截断多于10个字符的部分。
修剪分词过滤器（trim token filter）：删除1个分词中的所有空白部分。例如，分词" foo "将被转变为分词foo。
限制分词数量分词过滤器（limit token count token filter）：限制了某个字段可包含分词的最大数量。例如，如果创建了一个定制的分词数量过滤器，限制是8，那么分词流中只有前8个分词会被索引。这个设置使用max_ token_count参数，默认是1 （只有1个分词会被索引）。
常用内置分析器标准分析器

当没有指定分析器的时候，标准分析器（standard analyzer）是文本的默认分析器。它综合了对大多欧洲语言来说合理的默认模块，它没有字符过滤器，包括标准分词器、小写转换分词过滤器和停用词分词过滤器（默认为_none_，也就是不去除停止词）。这里只需要记住，如果不为某个字段指定分析器，那么该字段就会使用标准分析器。

可配置的参数如下：

max_token_length：默认值255，表示词项最大长度，超过这个长度将按该长度分为多个词项。
stopwords：默认值_none_，表示分析器使用的停止词数组，可使用内置停止词列表，比如_english_等。
stopwords_path：停止词文件路径。
简单分析器

简单分析器（simple analyzer）就是那么简单！它只使用了小写转换分词器，这意味着在非字母处进行分词，并将分词自动转变为小写。这个分析器对于亚洲语言来说效果不佳，因为亚洲语言不是根据空白来分词，所以请仅仅针对欧洲语言使用它。

空白分析器

空白分析器（whitespace analyzer）什么事情都不做，只是根据空白将文本切分为若干分词。

停用词分析器

停用词分析器（stop analyzer）和简单分析器的行为很相像，只是在分词流中额外地过滤了停用词。

关键词分析器

关键词分析器（keyword analyzer）将整个字段当作一个单独的分词。

模式分析器

模板分析器（pattern analyzer）允许你指定一个分词切分的模式。但是，由于可能无论如何都要指定模式，通常更有意义的做法是使用定制分析器，组合现有的模式分词器和所需的分词过滤器。

雪球分析器

雪球分析器（snowball analyzer）除了使用标准的分词器和分词过滤器（和标准分析器一样），也使用了小写分词过滤器和停用词过滤器。它还使用了雪球词干器对文本进行词干提取。

自定义分析器

业务需求如下：

去除所有的 HTML 标签
将 & 替换成 and ，使用一个自定义的 mapping 字符过滤器
使用 standard 分词器分割单词
使用 lowercase 分词过滤器将词转为小写
用 stop 分词过滤器去除一些自定义停用词。

PUT pattern_custom
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "char_filter": [
            "html_strip",
            "&_to_and"
          ],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_stopwords"
          ],
          "type": "custom"
        }
      },
      "char_filter": {
        "&_to_and": {
          "type": "mapping",
          "mappings": [
            "&=>and"
          ]
        }
      },
      "filter": {
        "my_stopwords": {
          "stopwords": [
            "666",
            "888"
          ],
          "type": "stop"
        }
      }
    }
  }

POST pattern_custom/_analyze
{
  "analyzer": "my_analyzer",
  "text": "I & Morris & 666 & 888 are handsome"
}
输出结果：
{
  "tokens" : [
    {
      "token" : "i",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "",
      "position" : 0
    },
    {
      "token" : "and",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "",
      "position" : 1
    },
    {
      "token" : "morris",
      "start_offset" : 4,
      "end_offset" : 10,
      "type" : "",
      "position" : 2
    },
    {
      "token" : "and",
      "start_offset" : 11,
      "end_offset" : 12,
      "type" : "",
      "position" : 3
    },
    {
      "token" : "and",
      "start_offset" : 17,
      "end_offset" : 18,
      "type" : "",
      "position" : 5
    },
    {
      "token" : "are",
      "start_offset" : 23,
      "end_offset" : 26,
      "type" : "",
      "position" : 7
    },
    {
      "token" : "handsome",
      "start_offset" : 27,
      "end_offset" : 35,
      "type" : "",
      "position" : 8
    }
  ]
}

中文分析器

上面的分析器基本都是针对英文的，对中文的处理不是太好，比如：

post _analyze
{
  "analyzer": "standard",
  "text": "中华人民共和国国歌"
}

分析后的结果是：

#! Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.14/security-minimal-setup.html to enable security.
{
  "tokens" : [
    {
      "token" : "中",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "",
      "position" : 0
    },
    {
      "token" : "华",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "",
      "position" : 1
    },
    {
      "token" : "人",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "",
      "position" : 2
    },
    {
      "token" : "民",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "",
      "position" : 3
    },
    {
      "token" : "共",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "",
      "position" : 4
    },
    {
      "token" : "和",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "",
      "position" : 5
    },
    {
      "token" : "国",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "",
      "position" : 6
    },
    {
      "token" : "国",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "",
      "position" : 7
    },
    {
      "token" : "歌",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "",
      "position" : 8
    }
  ]
}

Standard分析器把中文语句拆分为一个个的汉字，并不是太适合。这时候，就需要中文分析器。中文分析器有很多，例如 cjk，ik等等，我们选用比较有名的ik作为我们的中文分析器。

安装

进入elasticsearch目录下的plugins目录，并执行：

elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.14.1/elasticsearch-analysis-ik-7.14.1.zip

安装完成后，必须重启elasticsearch。

注意ik的版本必须与es的版本保持一致。

也可以手动下载后放入到plugin目录下并解压。

使用

IK分词器有两种分词效果：

ik_max_word：最大分词，会将文本做最细粒度的拆分，会穷尽各种可能的组合
ik_smart：最小分词，会做最粗粒度的拆分

post _analyze
{
  "analyzer": "ik_max_word",
  "text": "中华人民共和国国歌"
}
输出结果：
{
  "tokens" : [
    {
      "token" : "中华人民共和国",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "中华人民",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "中华",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "华人",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "人民共和国",
      "start_offset" : 2,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "人民",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "共和国",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "共和",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "国",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 8
    },
    {
      "token" : "国歌",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 9
    }
  ]
}

post _analyze
{
  "analyzer": "ik_smart",
  "text": "中华人民共和国国歌"
}
输出结果：
{
  "tokens" : [
    {
      "token" : "中华人民共和国",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "国歌",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 1
    }
  ]
}

Elasticsearch之内置分析器与自定义分析器

大数据系统相关栏目本月热门文章