Elasticsearch分词器_大数据系统

ES文档的数据拆分成一个个有完整含义的关键词，并将关键词与文档对应，这样就可以通过关键词查询文档。要想正确的分词，需要选择合适的分词器。

1.默认分词器

standard analyzer：Elasticsearch 默认分词器，根据空格和标点符号对英文进行分词，会进行单词的大小写转换。
默认分词器是英文分词器，对中文的分词是一字一词。
查看分词效果：

GET /_analyze 
{ 
	"text":测试语句, 
	"analyzer":分词器 
}

1.英文分词
GET /_analyze
{
  "text":"iphone13 is the better",
  "analyzer": "standard"
}
分词结果：
{
  "tokens" : [
    {
      "token" : "iphone13",
      "start_offset" : 0,
      "end_offset" : 8,
      "type" : "",
      "position" : 0
    },
    {
      "token" : "is",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "",
      "position" : 1
    },
    {
      "token" : "the",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "",
      "position" : 2
    },
    {
      "token" : "better",
      "start_offset" : 16,
      "end_offset" : 22,
      "type" : "",
      "position" : 3
    }
  ]
}
2.中文分词
GET /_analyze
{
  "text":"科比是NBA最伟大的运动员",
  "analyzer": "standard"
}
分词结果：
{
  "tokens" : [
    {
      "token" : "科",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "",
      "position" : 0
    },
    {
      "token" : "比",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "",
      "position" : 1
    },
    {
      "token" : "是",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "",
      "position" : 2
    },
    {
      "token" : "nba",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "",
      "position" : 3
    },
    {
      "token" : "最",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "",
      "position" : 4
    },
    {
      "token" : "伟",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "",
      "position" : 5
    },
    {
      "token" : "大",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "",
      "position" : 6
    },
    {
      "token" : "的",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "",
      "position" : 7
    },
    {
      "token" : "运",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "",
      "position" : 8
    },
    {
      "token" : "动",
      "start_offset" : 11,
      "end_offset" : 12,
      "type" : "",
      "position" : 9
    },
    {
      "token" : "员",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "",
      "position" : 10
    }
  ]
}

2.IK分词器

概念:
IKAnalyzer是一个开源的，基于java语言开发的轻量级的中文分词工具包。提供了两种分词算法：
1.ik_smart：最少切分

GET /_analyze
{
  "text":"科比是NBA最伟大的运动员",
  "analyzer": "ik_smart"
}
分词数量为7
{
  "tokens" : [
    {
      "token" : "科",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "比",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "是",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "nba",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "ENGLISH",
      "position" : 3
    },
    {
      "token" : "最",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "伟大",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "的",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "CN_CHAR",
      "position" : 6
    },
    {
      "token" : "运动员",
      "start_offset" : 10,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 7
    }
  ]
}

2.ik_max_word：最细粒度划分

GET /_analyze
{
  "text":"科比是NBA最伟大的运动员",
  "analyzer": "ik_max_word"
}
分词数量为9
{
  "tokens" : [
    {
      "token" : "科",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "比",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "是",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "nba",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "ENGLISH",
      "position" : 3
    },
    {
      "token" : "最",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "伟大",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "的",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "CN_CHAR",
      "position" : 6
    },
    {
      "token" : "运动员",
      "start_offset" : 10,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "运动",
      "start_offset" : 10,
      "end_offset" : 12,
      "type" : "CN_WORD",
      "position" : 8
    },
    {
      "token" : "动员",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 9
    }
  ]
}

安装
解压elasticsearch-analysis-ik，将解压后的文件夹拷贝到elasticsearch的plugins目录下。

[root@node0 plugins]# ls
elasticsearch-analysis-ik-7.12.1

注意：ik分词器的版本要和es版本保持一致
重启es

词典

[root@node0 config]# ll
total 8268
-rw-r--r-- 1 root root       7 Nov 29 17:57 ext_dict.dic
-rw-r--r-- 1 root root 5225922 Apr 25  2021 extra_main.dic
-rw-r--r-- 1 root root   63188 Apr 25  2021 extra_single_word.dic
-rw-r--r-- 1 root root   63188 Apr 25  2021 extra_single_word_full.dic
-rw-r--r-- 1 root root   10855 Apr 25  2021 extra_single_word_low_freq.dic
-rw-r--r-- 1 root root     156 Apr 25  2021 extra_stopword.dic
-rw-r--r-- 1 root root       7 Nov 29 18:07 ext_stopwords.dic
-rw-r--r-- 1 root root     654 Nov 29 17:56 IKAnalyzer.cfg.xml
-rw-r--r-- 1 root root 3058510 Apr 25  2021 main.dic
-rw-r--r-- 1 root root     123 Apr 25  2021 preposition.dic
-rw-r--r-- 1 root root    1824 Apr 25  2021 quantifier.dic
-rw-r--r-- 1 root root     164 Apr 25  2021 stopword.dic
-rw-r--r-- 1 root root     192 Apr 25  2021 suffix.dic
-rw-r--r-- 1 root root     752 Apr 25  2021 surname.dic




        IK Analyzer 扩展配置
        
        ext_dict.dic
         
        ext_stopwords.dic

K分词器根据词典进行分词，词典文件在IK分词器的config目录中。
1.main.dic：IK 中内置的词典。记录了 IK 统计的所有中文单词。
2.IKAnalyzer.cfg.xml：用于配置自定义词库。

ext_dict：自定义扩展词库，是对 main.dic 文件的扩展。
ext_stopwords：自定义停用词。

ik的所有的 dic 词库文件，必须使用UTF-8 字符集。不建议使用记事本编辑，记事本使用的是
GBK字符集。

测试分词器效果
将“科比”添加到 ext_dict.dic文件里

GET /_analyze
{
  "text":"科比是NBA最伟大的运动员",
  "analyzer": "ik_smart"
}
测试结果将“科比”作为分词
{
  "tokens" : [
    {
      "token" : "科比",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "nba",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "ENGLISH",
      "position" : 2
    },
    {
      "token" : "最",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "伟大",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "的",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "CN_CHAR",
      "position" : 5
    },
    {
      "token" : "运动员",
      "start_offset" : 10,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 6
    }
  ]
}

将“科比”添加到ext_stopwords.dic文件里
测试分词，

GET /_analyze
{
  "text":"科比是NBA最伟大的运动员",
  "analyzer": "ik_smart"
}
分词中“科比被禁用”
{
  "tokens" : [
    {
      "token" : "是",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "nba",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "ENGLISH",
      "position" : 1
    },
    {
      "token" : "最",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "伟大",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "的",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "运动员",
      "start_offset" : 10,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 5
    }
  ]
}

3.拼音分词器

概念
拼音分词器可以将中文分成对应的全拼，全拼首字母等。
安装
解压elasticsearch-analysis-pinyin，将解压后的文件夹拷贝elasticsearch的plugins目录下。

[root@node0 plugins]# ll
total 8
drwxr-xr-x 3 root root 4096 Nov 29 17:38 elasticsearch-analysis-ik-7.12.1
drwxr-xr-x 2 root root 4096 Nov 29 18:36 elasticsearch-analysis-pinyin-7.12.1

注：拼音分词器的版本要和es版本保持一致。
重启es

测试分词结果

GET /_analyze
{
  "text":"科比是NBA最伟大的运动员",
  "analyzer": "pinyin"
}
分词结果为拼音
{
  "tokens" : [
    {
      "token" : "ke",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "kbsnbazwddydy",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "bi",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "shi",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "n",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "ba",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "zui",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "wei",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "da",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "de",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "yun",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "dong",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "yuan",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 11
    }
  ]
}

Elasticsearch分词器

大数据系统相关栏目本月热门文章