ELK高级搜索四之Mapping映射和分词器

Mapping映射入门什么是mapping映射?

自动或手动为index中的_doc建立的一种数据结构和相关配置，简称为mapping映射。

动态映射dynamic mapping

插入几条数据，es会自动为我们建立一个索引,以及对应的mapping，mapping中包含了每个field对应的数据类型，以及如何分词等设置

// 创建文档请求
PUT  localhost:9200/blog/_doc/1
{
    "title":"内蒙古科右中旗：沃野千里织锦绣---修改操作",
    "description":"内蒙古兴安盟科右中旗巴彦呼舒镇乌逊嘎查整洁的村容村貌。近年来，内蒙古自治区兴安盟科尔沁右翼中旗按照“产业兴旺、生态宜居、乡风文明、治理有效、生活富裕”的总要求，坚持科学规划、合理布...国际在线",
    "publish_time":"2020-07-08"
}

手动创建映射

创建索引后，最好是手动创建映射

PUT localhost:9200/book/_mapping
{
    "properties":{
        "name":{
            "type":"text"
        },
        "description":{
            "type":"text",
            "analyzer":"english",
            "search_analyzer":"english"
        },
        "pic":{
            "type":"text",
            "index":"false"
        },
        "publish_time":{
            "type":"date"
        }
    }
}

查询映射

GET   localhost:9200/blog/_mapping
{
    "blog": {
        "mappings": {
            "properties": {
                "description": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "id": {
                    "type": "long"
                },
                "publish_time": {
                    "type": "date"
                },
                "title": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                }
            }
        }
    }
}

映射测试

插入文档

PUT localhost:9200/book/_doc/1
{
  "name":"Java核心技术",
  "description":"本书由拥有20多年教学与研究经验的资深Java技术专家撰写（获Jolt大奖），是程序员的优选Java指南。本版针对Java SE 9、10和 11全面更新。",
  "pic":"item.jd.com",
  "publish_time":"2022-04-19"
}

测试查询 localhost:9200/book/_search?q=name:java

GET localhost:9200/book/_search?q=name:java

{
    "took": 1126,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 0.2876821,
        "hits": [
            {
                "_index": "book",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.2876821,
                "_source": {
                    "name": "Java核心技术",
                    "description": "本书由拥有20多年教学与研究经验的资深Java技术专家撰写（获Jolt大奖），是程序员的优选Java指南。本版针对Java SE 9、10和 11全面更新。",
                    "pic": "item.jd.com",
                    "publish_time": "2022-04-19"
                }
            }
        ]
    }
}

测试查询 localhost:9200/book/_search?q=description:java

GET localhost:9200/book/_search?q=description:java

{
    "took": 15,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 0.45390707,
        "hits": [
            {
                "_index": "book",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.45390707,
                "_source": {
                    "name": "Java核心技术",
                    "description": "本书由拥有20多年教学与研究经验的资深Java技术专家撰写（获Jolt大奖），是程序员的优选Java指南。本版针对Java SE 9、10和 11全面更新。",
                    "pic": "item.jd.com",
                    "publish_time": "2022-04-19"
                }
            }
        ]
    }
}

测试查询 localhost:9200/book/_search?q=pic:www.jd.com

GET localhost:9200/book/_search?q=pic:item.jd.com

{
    "took": 8,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 0,
            "relation": "eq"
        },
        "max_score": null,
        "hits": []
    }
}

通过测试发现：name和description都支持全文检索，pic不可作为查询条件。

修改映射

只能在创建index时手动建立mapping，或者新增field mapping，但是不能update field mapping。
因为已有数据按照映射早已分词存储好。如果修改，那这些存量数据怎么办。
新增一个字段mapping

PUT localhost:9200/book/_mapping
{
    "properties":{
        "ISBN":{
            "type":"text",
            "fields":{
                "raw":{
                    "type":"keyword"
                }
            }
        }
    }
}

修改数据

PUT localhost:9200/book/_doc/1
{
  "name":"Java核心技术",
  "description":"本书由拥有20多年教学与研究经验的资深Java技术专家撰写（获Jolt大奖），是程序员的优选Java指南。本版针对Java SE 9、10和 11全面更新。",
  "pic":"item.jd.com",
  "publish_time":"2022-04-19",
  "ISBN":"12800420"
}

搜索ISBN

GET localhost:9200/book/_search?q=ISBN:12800420

{
    "took": 949,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 0.2876821,
        "hits": [
            {
                "_index": "book",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.2876821,
                "_source": {
                    "name": "Java核心技术",
                    "description": "本书由拥有20多年教学与研究经验的资深Java技术专家撰写（获Jolt大奖），是程序员的优选Java指南。本版针对Java SE 9、10和 11全面更新。",
                    "pic": "item.jd.com",
                    "publish_time": "2022-04-19",
                    "ISBN": "12800420"
                }
            }
        ]
    }
}

分词器的介绍和使用什么是分词器

分词器 接受一个字符串作为输入，将这个字符串拆分成独立的词或 语汇单元（token） （可能会丢弃一些标点符号等字符），然后输出一个 语汇单元流（token stream） 。

有趣的是用于词汇识别的算法。 whitespace （空白字符）分词器按空白字符 —— 空格、tabs、换行符等等进行简单拆分 —— 然后假定连续的非空格字符组成了一个语汇单元。

将用户输入的一段文本，按照一定逻辑，分析成多个词语的一种工具。常用的内置分词器
standard analyzer、simple analyzer、whitespace analyzer、stop analyzer、language analyzer、pattern analyzer

standard analyzer

标准分析器是默认分词器，如果未指定，则使用该分词器。

POST http://127.0.0.1:9200/_analyze
{
 "analyzer":"standard",
 "text":"我是程序员"
}


{
    "tokens": [
        {
            "token": "我",
            "start_offset": 0,
            "end_offset": 1,
            "type": "",
            "position": 0
        },
        {
            "token": "是",
            "start_offset": 1,
            "end_offset": 2,
            "type": "",
            "position": 1
        },
        {
            "token": "程",
            "start_offset": 2,
            "end_offset": 3,
            "type": "",
            "position": 2
        },
        {
            "token": "序",
            "start_offset": 3,
            "end_offset": 4,
            "type": "",
            "position": 3
        },
        {
            "token": "员",
            "start_offset": 4,
            "end_offset": 5,
            "type": "",
            "position": 4
        }
    ]
}

simple analyzer

simple 分析器当它遇到只要不是字母的字符，就将文本解析成 term，而且所有的 term 都是小写的。

POST http://127.0.0.1:9200/_analyze
{
  "analyzer":"simple",
  "text":"this is a book"
}

{
    "tokens": [
        {
            "token": "this",
            "start_offset": 0,
            "end_offset": 4,
            "type": "word",
            "position": 0
        },
        {
            "token": "is",
            "start_offset": 5,
            "end_offset": 7,
            "type": "word",
            "position": 1
        },
        {
            "token": "a",
            "start_offset": 8,
            "end_offset": 9,
            "type": "word",
            "position": 2
        },
        {
            "token": "book",
            "start_offset": 10,
            "end_offset": 14,
            "type": "word",
            "position": 3
        }
    ]
}

whitespace analyzer

whitespace 分析器，当它遇到空白字符时，就将文本解析成terms

POST  http://127.0.0.1:9200/_analyze
{
  "analyzer":"whitespace",
  "text":"this is a book"
}

{
    "tokens": [
        {
            "token": "this",
            "start_offset": 0,
            "end_offset": 4,
            "type": "word",
            "position": 0
        },
        {
            "token": "is",
            "start_offset": 5,
            "end_offset": 7,
            "type": "word",
            "position": 1
        },
        {
            "token": "a",
            "start_offset": 8,
            "end_offset": 9,
            "type": "word",
            "position": 2
        },
        {
            "token": "book",
            "start_offset": 10,
            "end_offset": 14,
            "type": "word",
            "position": 3
        }
    ]
}

stop analyzer

stop 分析器和 simple 分析器很像，唯一不同的是，stop 分析器增加了对删除停止词的支持，默认使用了 english 停止词

stopwords 预定义的停止词列表，比如 (the,a,an,this,of,at)等等。

POST  http://127.0.0.1:9200/_analyze

{
  "analyzer":"stop",
  "text":"this is a book"
}

{
    "tokens": [
        {
            "token": "book",
            "start_offset": 10,
            "end_offset": 14,
            "type": "word",
            "position": 3
        }
    ]
}

中文分词器

安装
下载地址：https://github.com/medcl/elasticsearch-analysis-ik/releases。解压到 es/plugins/ik中，如图：

ik分词器的使用

ik_max_word: 会将文本做最细粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国，中华人民，中华，华人，人民共和国，人民，共和国，共和，国，国歌”，会穷尽各种可能的组合；
ik_smart: 会做最粗粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国，国歌”。

配置文件

文件	描述
IKAnalyzer.cfg.xml	用来配置自定义词库
main.dic	ik原生内置的中文词库，总共有27万多条，只要是这些单词，都会被分在一起
preposition.dic	介词
quantifier.dic	放了一些单位相关的词，量词
suffix.dic	放了一些后缀
surname.dic	中国的姓氏
stopword.dic	英文停用词

IKAnalyzer.cfg.xml




	IK Analyzer 扩展配置

POST localhost:9200/_analyze
{
    "analyzer":"ik_smart",
    "text":"中华人民共和国国歌"
}


{
    "tokens": [
        {
            "token": "中华人民共和国",
            "start_offset": 0,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "国歌",
            "start_offset": 7,
            "end_offset": 9,
            "type": "CN_WORD",
            "position": 1
        }
    ]
}

POST localhost:9200/_analyze
{
    "analyzer":"ik_max_word",
    "text":"中华人民共和国国歌"
}

{
    "tokens": [
        {
            "token": "中华人民共和国",
            "start_offset": 0,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "中华人民",
            "start_offset": 0,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "中华",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "华人",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "人民共和国",
            "start_offset": 2,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 4
        },
        {
            "token": "人民",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 5
        },
        {
            "token": "共和国",
            "start_offset": 4,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 6
        },
        {
            "token": "共和",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 7
        },
        {
            "token": "国",
            "start_offset": 6,
            "end_offset": 7,
            "type": "CN_CHAR",
            "position": 8
        },
        {
            "token": "国歌",
            "start_offset": 7,
            "end_offset": 9,
            "type": "CN_WORD",
            "position": 9
        }
    ]
}

自定义词库

POST localhost:9200/_analyze
{
    "analyzer":"ik_smart",
    "text":"魔兽世界"
}


//分词
{
    "tokens": [
        {
            "token": "魔兽",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "世界",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        }
    ]
}

创建mydic.dic 文件，添加内容“魔兽世界”

修改IKAnalyzer.cfg.xml后重启ES，在测试分词效果




	IK Analyzer 扩展配置
	
	mydic.dic

POST localhost:9200/_analyze
{
    "analyzer":"ik_smart",
    "text":"魔兽世界"
}

//分词
{
    "tokens": [
        {
            "token": "魔兽世界",
            "start_offset": 0,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 0
        }
    ]
}

ELK高级搜索四之Mapping映射和分词器

Java相关栏目本月热门文章