ES文档的数据拆分成一个个有完整含义的关键词,并将关键词与文档对应,这样就可以通过关键词查询文档。要想正确的分词,需要选择合适的分词器。
1.默认分词器standard analyzer:Elasticsearch 默认分词器,根据空格和标点符号对英文进行分词,会进行单词的大小写转换。
默认分词器是英文分词器,对中文的分词是一字一词。
查看分词效果:
GET /_analyze
{
"text":测试语句,
"analyzer":分词器
}
1.英文分词
GET /_analyze
{
"text":"iphone13 is the better",
"analyzer": "standard"
}
分词结果:
{
"tokens" : [
{
"token" : "iphone13",
"start_offset" : 0,
"end_offset" : 8,
"type" : "",
"position" : 0
},
{
"token" : "is",
"start_offset" : 9,
"end_offset" : 11,
"type" : "",
"position" : 1
},
{
"token" : "the",
"start_offset" : 12,
"end_offset" : 15,
"type" : "",
"position" : 2
},
{
"token" : "better",
"start_offset" : 16,
"end_offset" : 22,
"type" : "",
"position" : 3
}
]
}
2.中文分词
GET /_analyze
{
"text":"科比是NBA最伟大的运动员",
"analyzer": "standard"
}
分词结果:
{
"tokens" : [
{
"token" : "科",
"start_offset" : 0,
"end_offset" : 1,
"type" : "",
"position" : 0
},
{
"token" : "比",
"start_offset" : 1,
"end_offset" : 2,
"type" : "",
"position" : 1
},
{
"token" : "是",
"start_offset" : 2,
"end_offset" : 3,
"type" : "",
"position" : 2
},
{
"token" : "nba",
"start_offset" : 3,
"end_offset" : 6,
"type" : "",
"position" : 3
},
{
"token" : "最",
"start_offset" : 6,
"end_offset" : 7,
"type" : "",
"position" : 4
},
{
"token" : "伟",
"start_offset" : 7,
"end_offset" : 8,
"type" : "",
"position" : 5
},
{
"token" : "大",
"start_offset" : 8,
"end_offset" : 9,
"type" : "",
"position" : 6
},
{
"token" : "的",
"start_offset" : 9,
"end_offset" : 10,
"type" : "",
"position" : 7
},
{
"token" : "运",
"start_offset" : 10,
"end_offset" : 11,
"type" : "",
"position" : 8
},
{
"token" : "动",
"start_offset" : 11,
"end_offset" : 12,
"type" : "",
"position" : 9
},
{
"token" : "员",
"start_offset" : 12,
"end_offset" : 13,
"type" : "",
"position" : 10
}
]
}
2.IK分词器
- 概念:
IKAnalyzer是一个开源的,基于java语言开发的轻量级的中文分词工具包。提供了两种分词算法:
1.ik_smart:最少切分
GET /_analyze
{
"text":"科比是NBA最伟大的运动员",
"analyzer": "ik_smart"
}
分词数量为7
{
"tokens" : [
{
"token" : "科",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "比",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "是",
"start_offset" : 2,
"end_offset" : 3,
"type" : "CN_CHAR",
"position" : 2
},
{
"token" : "nba",
"start_offset" : 3,
"end_offset" : 6,
"type" : "ENGLISH",
"position" : 3
},
{
"token" : "最",
"start_offset" : 6,
"end_offset" : 7,
"type" : "CN_CHAR",
"position" : 4
},
{
"token" : "伟大",
"start_offset" : 7,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "的",
"start_offset" : 9,
"end_offset" : 10,
"type" : "CN_CHAR",
"position" : 6
},
{
"token" : "运动员",
"start_offset" : 10,
"end_offset" : 13,
"type" : "CN_WORD",
"position" : 7
}
]
}
2.ik_max_word:最细粒度划分
GET /_analyze
{
"text":"科比是NBA最伟大的运动员",
"analyzer": "ik_max_word"
}
分词数量为9
{
"tokens" : [
{
"token" : "科",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "比",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "是",
"start_offset" : 2,
"end_offset" : 3,
"type" : "CN_CHAR",
"position" : 2
},
{
"token" : "nba",
"start_offset" : 3,
"end_offset" : 6,
"type" : "ENGLISH",
"position" : 3
},
{
"token" : "最",
"start_offset" : 6,
"end_offset" : 7,
"type" : "CN_CHAR",
"position" : 4
},
{
"token" : "伟大",
"start_offset" : 7,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "的",
"start_offset" : 9,
"end_offset" : 10,
"type" : "CN_CHAR",
"position" : 6
},
{
"token" : "运动员",
"start_offset" : 10,
"end_offset" : 13,
"type" : "CN_WORD",
"position" : 7
},
{
"token" : "运动",
"start_offset" : 10,
"end_offset" : 12,
"type" : "CN_WORD",
"position" : 8
},
{
"token" : "动员",
"start_offset" : 11,
"end_offset" : 13,
"type" : "CN_WORD",
"position" : 9
}
]
}
- 安装
解压elasticsearch-analysis-ik,将解压后的文件夹拷贝到elasticsearch的plugins目录下。
[root@node0 plugins]# ls elasticsearch-analysis-ik-7.12.1
注意:ik分词器的版本要和es版本保持一致
重启es
- 词典
[root@node0 config]# ll total 8268 -rw-r--r-- 1 root root 7 Nov 29 17:57 ext_dict.dic -rw-r--r-- 1 root root 5225922 Apr 25 2021 extra_main.dic -rw-r--r-- 1 root root 63188 Apr 25 2021 extra_single_word.dic -rw-r--r-- 1 root root 63188 Apr 25 2021 extra_single_word_full.dic -rw-r--r-- 1 root root 10855 Apr 25 2021 extra_single_word_low_freq.dic -rw-r--r-- 1 root root 156 Apr 25 2021 extra_stopword.dic -rw-r--r-- 1 root root 7 Nov 29 18:07 ext_stopwords.dic -rw-r--r-- 1 root root 654 Nov 29 17:56 IKAnalyzer.cfg.xml -rw-r--r-- 1 root root 3058510 Apr 25 2021 main.dic -rw-r--r-- 1 root root 123 Apr 25 2021 preposition.dic -rw-r--r-- 1 root root 1824 Apr 25 2021 quantifier.dic -rw-r--r-- 1 root root 164 Apr 25 2021 stopword.dic -rw-r--r-- 1 root root 192 Apr 25 2021 suffix.dic -rw-r--r-- 1 root root 752 Apr 25 2021 surname.dic
IK Analyzer 扩展配置 ext_dict.dic ext_stopwords.dic
K分词器根据词典进行分词,词典文件在IK分词器的config目录中。
1.main.dic:IK 中内置的词典。记录了 IK 统计的所有中文单词。
2.IKAnalyzer.cfg.xml:用于配置自定义词库。
- ext_dict:自定义扩展词库,是对 main.dic 文件的扩展。
- ext_stopwords:自定义停用词。
ik的所有的 dic 词库文件,必须使用UTF-8 字符集。不建议使用记事本编辑,记事本使用的是
GBK字符集。
- 测试分词器效果
将“科比”添加到 ext_dict.dic文件里
GET /_analyze
{
"text":"科比是NBA最伟大的运动员",
"analyzer": "ik_smart"
}
测试结果将“科比”作为分词
{
"tokens" : [
{
"token" : "科比",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "是",
"start_offset" : 2,
"end_offset" : 3,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "nba",
"start_offset" : 3,
"end_offset" : 6,
"type" : "ENGLISH",
"position" : 2
},
{
"token" : "最",
"start_offset" : 6,
"end_offset" : 7,
"type" : "CN_CHAR",
"position" : 3
},
{
"token" : "伟大",
"start_offset" : 7,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "的",
"start_offset" : 9,
"end_offset" : 10,
"type" : "CN_CHAR",
"position" : 5
},
{
"token" : "运动员",
"start_offset" : 10,
"end_offset" : 13,
"type" : "CN_WORD",
"position" : 6
}
]
}
将“科比”添加到ext_stopwords.dic文件里
测试分词,
GET /_analyze
{
"text":"科比是NBA最伟大的运动员",
"analyzer": "ik_smart"
}
分词中“科比被禁用”
{
"tokens" : [
{
"token" : "是",
"start_offset" : 2,
"end_offset" : 3,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "nba",
"start_offset" : 3,
"end_offset" : 6,
"type" : "ENGLISH",
"position" : 1
},
{
"token" : "最",
"start_offset" : 6,
"end_offset" : 7,
"type" : "CN_CHAR",
"position" : 2
},
{
"token" : "伟大",
"start_offset" : 7,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "的",
"start_offset" : 9,
"end_offset" : 10,
"type" : "CN_CHAR",
"position" : 4
},
{
"token" : "运动员",
"start_offset" : 10,
"end_offset" : 13,
"type" : "CN_WORD",
"position" : 5
}
]
}
3.拼音分词器
- 概念
拼音分词器可以将中文分成对应的全拼,全拼首字母等。 - 安装
解压elasticsearch-analysis-pinyin,将解压后的文件夹拷贝elasticsearch的plugins目录下。
[root@node0 plugins]# ll total 8 drwxr-xr-x 3 root root 4096 Nov 29 17:38 elasticsearch-analysis-ik-7.12.1 drwxr-xr-x 2 root root 4096 Nov 29 18:36 elasticsearch-analysis-pinyin-7.12.1
注:拼音分词器的版本要和es版本保持一致。
重启es
- 测试分词结果
GET /_analyze
{
"text":"科比是NBA最伟大的运动员",
"analyzer": "pinyin"
}
分词结果为拼音
{
"tokens" : [
{
"token" : "ke",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0
},
{
"token" : "kbsnbazwddydy",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0
},
{
"token" : "bi",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 1
},
{
"token" : "shi",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 2
},
{
"token" : "n",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 3
},
{
"token" : "ba",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 4
},
{
"token" : "zui",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 5
},
{
"token" : "wei",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 6
},
{
"token" : "da",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 7
},
{
"token" : "de",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 8
},
{
"token" : "yun",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 9
},
{
"token" : "dong",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 10
},
{
"token" : "yuan",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 11
}
]
}



