动态映射dynamic mapping自动或手动为index中的_doc建立的一种数据结构和相关配置,简称为mapping映射。
插入几条数据,es会自动为我们建立一个索引,以及对应的mapping,mapping中包含了每个field对应的数据类型,以及如何分词等设置
// 创建文档请求
PUT localhost:9200/blog/_doc/1
{
"title":"内蒙古科右中旗:沃野千里织锦绣---修改操作",
"description":"内蒙古兴安盟科右中旗巴彦呼舒镇乌逊嘎查整洁的村容村貌。近年来,内蒙古自治区兴安盟科尔沁右翼中旗按照“产业兴旺、生态宜居、乡风文明、治理有效、生活富裕”的总要求,坚持科学规划、合理布...国际在线",
"publish_time":"2020-07-08"
}
手动创建映射
创建索引后,最好是手动创建映射
PUT localhost:9200/book/_mapping
{
"properties":{
"name":{
"type":"text"
},
"description":{
"type":"text",
"analyzer":"english",
"search_analyzer":"english"
},
"pic":{
"type":"text",
"index":"false"
},
"publish_time":{
"type":"date"
}
}
}
查询映射
GET localhost:9200/blog/_mapping
{
"blog": {
"mappings": {
"properties": {
"description": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"id": {
"type": "long"
},
"publish_time": {
"type": "date"
},
"title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
映射测试
插入文档
PUT localhost:9200/book/_doc/1
{
"name":"Java核心技术",
"description":"本书由拥有20多年教学与研究经验的资深Java技术专家撰写(获Jolt大奖),是程序员的优选Java指南。本版针对Java SE 9、10和 11全面更新。",
"pic":"item.jd.com",
"publish_time":"2022-04-19"
}
测试查询 localhost:9200/book/_search?q=name:java
GET localhost:9200/book/_search?q=name:java
{
"took": 1126,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.2876821,
"hits": [
{
"_index": "book",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"name": "Java核心技术",
"description": "本书由拥有20多年教学与研究经验的资深Java技术专家撰写(获Jolt大奖),是程序员的优选Java指南。本版针对Java SE 9、10和 11全面更新。",
"pic": "item.jd.com",
"publish_time": "2022-04-19"
}
}
]
}
}
测试查询 localhost:9200/book/_search?q=description:java
GET localhost:9200/book/_search?q=description:java
{
"took": 15,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.45390707,
"hits": [
{
"_index": "book",
"_type": "_doc",
"_id": "1",
"_score": 0.45390707,
"_source": {
"name": "Java核心技术",
"description": "本书由拥有20多年教学与研究经验的资深Java技术专家撰写(获Jolt大奖),是程序员的优选Java指南。本版针对Java SE 9、10和 11全面更新。",
"pic": "item.jd.com",
"publish_time": "2022-04-19"
}
}
]
}
}
测试查询 localhost:9200/book/_search?q=pic:www.jd.com
GET localhost:9200/book/_search?q=pic:item.jd.com
{
"took": 8,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
}
}
通过测试发现:name和description都支持全文检索,pic不可作为查询条件。修改映射
只能在创建index时手动建立mapping,或者新增field mapping,但是不能update field mapping。
因为已有数据按照映射早已分词存储好。如果修改,那这些存量数据怎么办。
新增一个字段mapping
PUT localhost:9200/book/_mapping
{
"properties":{
"ISBN":{
"type":"text",
"fields":{
"raw":{
"type":"keyword"
}
}
}
}
}
修改数据
PUT localhost:9200/book/_doc/1
{
"name":"Java核心技术",
"description":"本书由拥有20多年教学与研究经验的资深Java技术专家撰写(获Jolt大奖),是程序员的优选Java指南。本版针对Java SE 9、10和 11全面更新。",
"pic":"item.jd.com",
"publish_time":"2022-04-19",
"ISBN":"12800420"
}
搜索ISBN
GET localhost:9200/book/_search?q=ISBN:12800420
{
"took": 949,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.2876821,
"hits": [
{
"_index": "book",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"name": "Java核心技术",
"description": "本书由拥有20多年教学与研究经验的资深Java技术专家撰写(获Jolt大奖),是程序员的优选Java指南。本版针对Java SE 9、10和 11全面更新。",
"pic": "item.jd.com",
"publish_time": "2022-04-19",
"ISBN": "12800420"
}
}
]
}
}
分词器的介绍和使用
什么是分词器
分词器 接受一个字符串作为输入,将这个字符串拆分成独立的词或 语汇单元(token) (可能会丢弃一些标点符号等字符),然后输出一个 语汇单元流(token stream) 。
有趣的是用于词汇 识别 的算法。 whitespace (空白字符)分词器按空白字符 —— 空格、tabs、换行符等等进行简单拆分 —— 然后假定连续的非空格字符组成了一个语汇单元。
将用户输入的一段文本,按照一定逻辑,分析成多个词语的一种工具。常用的内置分词器
standard analyzer、simple analyzer、whitespace analyzer、stop analyzer、language analyzer、pattern analyzer
标准分析器是默认分词器,如果未指定,则使用该分词器。
POST http://127.0.0.1:9200/_analyze
{
"analyzer":"standard",
"text":"我是程序员"
}
{
"tokens": [
{
"token": "我",
"start_offset": 0,
"end_offset": 1,
"type": "",
"position": 0
},
{
"token": "是",
"start_offset": 1,
"end_offset": 2,
"type": "",
"position": 1
},
{
"token": "程",
"start_offset": 2,
"end_offset": 3,
"type": "",
"position": 2
},
{
"token": "序",
"start_offset": 3,
"end_offset": 4,
"type": "",
"position": 3
},
{
"token": "员",
"start_offset": 4,
"end_offset": 5,
"type": "",
"position": 4
}
]
}
simple analyzer
simple 分析器当它遇到只要不是字母的字符,就将文本解析成 term,而且所有的 term 都是小写的。
POST http://127.0.0.1:9200/_analyze
{
"analyzer":"simple",
"text":"this is a book"
}
{
"tokens": [
{
"token": "this",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "is",
"start_offset": 5,
"end_offset": 7,
"type": "word",
"position": 1
},
{
"token": "a",
"start_offset": 8,
"end_offset": 9,
"type": "word",
"position": 2
},
{
"token": "book",
"start_offset": 10,
"end_offset": 14,
"type": "word",
"position": 3
}
]
}
whitespace analyzer
whitespace 分析器,当它遇到空白字符时,就将文本解析成terms
POST http://127.0.0.1:9200/_analyze
{
"analyzer":"whitespace",
"text":"this is a book"
}
{
"tokens": [
{
"token": "this",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "is",
"start_offset": 5,
"end_offset": 7,
"type": "word",
"position": 1
},
{
"token": "a",
"start_offset": 8,
"end_offset": 9,
"type": "word",
"position": 2
},
{
"token": "book",
"start_offset": 10,
"end_offset": 14,
"type": "word",
"position": 3
}
]
}
stop analyzer
stop 分析器 和 simple 分析器很像,唯一不同的是,stop 分析器增加了对删除停止词的支持,默认使用了 english 停止词
stopwords 预定义的停止词列表,比如 (the,a,an,this,of,at)等等。
POST http://127.0.0.1:9200/_analyze
{
"analyzer":"stop",
"text":"this is a book"
}
{
"tokens": [
{
"token": "book",
"start_offset": 10,
"end_offset": 14,
"type": "word",
"position": 3
}
]
}
中文分词器
安装 下载地址:https://github.com/medcl/elasticsearch-analysis-ik/releases。解压到 es/plugins/ik中,如图:
ik分词器的使用
- ik_max_word: 会将文本做最细粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,共和国,共和,国,国歌”,会穷尽各种可能的组合;
- ik_smart: 会做最粗粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”。
配置文件
| 文件 | 描述 |
| IKAnalyzer.cfg.xml | 用来配置自定义词库 |
| main.dic | ik原生内置的中文词库,总共有27万多条,只要是这些单词,都会被分在一起 |
| preposition.dic | 介词 |
| quantifier.dic | 放了一些单位相关的词,量词 |
| suffix.dic | 放了一些后缀 |
| surname.dic | 中国的姓氏 |
| stopword.dic | 英文停用词 |
IKAnalyzer.cfg.xml
IK Analyzer 扩展配置
POST localhost:9200/_analyze
{
"analyzer":"ik_smart",
"text":"中华人民共和国国歌"
}
{
"tokens": [
{
"token": "中华人民共和国",
"start_offset": 0,
"end_offset": 7,
"type": "CN_WORD",
"position": 0
},
{
"token": "国歌",
"start_offset": 7,
"end_offset": 9,
"type": "CN_WORD",
"position": 1
}
]
}
POST localhost:9200/_analyze
{
"analyzer":"ik_max_word",
"text":"中华人民共和国国歌"
}
{
"tokens": [
{
"token": "中华人民共和国",
"start_offset": 0,
"end_offset": 7,
"type": "CN_WORD",
"position": 0
},
{
"token": "中华人民",
"start_offset": 0,
"end_offset": 4,
"type": "CN_WORD",
"position": 1
},
{
"token": "中华",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 2
},
{
"token": "华人",
"start_offset": 1,
"end_offset": 3,
"type": "CN_WORD",
"position": 3
},
{
"token": "人民共和国",
"start_offset": 2,
"end_offset": 7,
"type": "CN_WORD",
"position": 4
},
{
"token": "人民",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 5
},
{
"token": "共和国",
"start_offset": 4,
"end_offset": 7,
"type": "CN_WORD",
"position": 6
},
{
"token": "共和",
"start_offset": 4,
"end_offset": 6,
"type": "CN_WORD",
"position": 7
},
{
"token": "国",
"start_offset": 6,
"end_offset": 7,
"type": "CN_CHAR",
"position": 8
},
{
"token": "国歌",
"start_offset": 7,
"end_offset": 9,
"type": "CN_WORD",
"position": 9
}
]
}
自定义词库
POST localhost:9200/_analyze
{
"analyzer":"ik_smart",
"text":"魔兽世界"
}
//分词
{
"tokens": [
{
"token": "魔兽",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "世界",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 1
}
]
}
创建mydic.dic 文件,添加内容“魔兽世界”
修改IKAnalyzer.cfg.xml后重启ES,在测试分词效果
IK Analyzer 扩展配置 mydic.dic
POST localhost:9200/_analyze
{
"analyzer":"ik_smart",
"text":"魔兽世界"
}
//分词
{
"tokens": [
{
"token": "魔兽世界",
"start_offset": 0,
"end_offset": 4,
"type": "CN_WORD",
"position": 0
}
]
}



