栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 软件开发 > 后端开发 > Java

ELK高级搜索四之Mapping映射和分词器

Java 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

ELK高级搜索四之Mapping映射和分词器

Mapping映射入门 什么是mapping映射?
自动或手动为index中的_doc建立的一种数据结构和相关配置,简称为mapping映射。
动态映射dynamic mapping
插入几条数据,es会自动为我们建立一个索引,以及对应的mapping,mapping中包含了每个field对应的数据类型,以及如何分词等设置
// 创建文档请求
PUT  localhost:9200/blog/_doc/1
{
    "title":"内蒙古科右中旗:沃野千里织锦绣---修改操作",
    "description":"内蒙古兴安盟科右中旗巴彦呼舒镇乌逊嘎查整洁的村容村貌。近年来,内蒙古自治区兴安盟科尔沁右翼中旗按照“产业兴旺、生态宜居、乡风文明、治理有效、生活富裕”的总要求,坚持科学规划、合理布...国际在线",
    "publish_time":"2020-07-08"
}
手动创建映射
创建索引后,最好是手动创建映射
PUT localhost:9200/book/_mapping
{
    "properties":{
        "name":{
            "type":"text"
        },
        "description":{
            "type":"text",
            "analyzer":"english",
            "search_analyzer":"english"
        },
        "pic":{
            "type":"text",
            "index":"false"
        },
        "publish_time":{
            "type":"date"
        }
    }
}
   查询映射 
GET   localhost:9200/blog/_mapping
{
    "blog": {
        "mappings": {
            "properties": {
                "description": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "id": {
                    "type": "long"
                },
                "publish_time": {
                    "type": "date"
                },
                "title": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                }
            }
        }
    }
}
 映射测试

插入文档

PUT localhost:9200/book/_doc/1
{
  "name":"Java核心技术",
  "description":"本书由拥有20多年教学与研究经验的资深Java技术专家撰写(获Jolt大奖),是程序员的优选Java指南。本版针对Java SE 9、10和 11全面更新。",
  "pic":"item.jd.com",
  "publish_time":"2022-04-19"
}

  测试查询  localhost:9200/book/_search?q=name:java

GET localhost:9200/book/_search?q=name:java

{
    "took": 1126,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 0.2876821,
        "hits": [
            {
                "_index": "book",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.2876821,
                "_source": {
                    "name": "Java核心技术",
                    "description": "本书由拥有20多年教学与研究经验的资深Java技术专家撰写(获Jolt大奖),是程序员的优选Java指南。本版针对Java SE 9、10和 11全面更新。",
                    "pic": "item.jd.com",
                    "publish_time": "2022-04-19"
                }
            }
        ]
    }
}

    测试查询  localhost:9200/book/_search?q=description:java

GET localhost:9200/book/_search?q=description:java

{
    "took": 15,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 0.45390707,
        "hits": [
            {
                "_index": "book",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.45390707,
                "_source": {
                    "name": "Java核心技术",
                    "description": "本书由拥有20多年教学与研究经验的资深Java技术专家撰写(获Jolt大奖),是程序员的优选Java指南。本版针对Java SE 9、10和 11全面更新。",
                    "pic": "item.jd.com",
                    "publish_time": "2022-04-19"
                }
            }
        ]
    }
}

    测试查询 localhost:9200/book/_search?q=pic:www.jd.com

GET localhost:9200/book/_search?q=pic:item.jd.com

{
    "took": 8,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 0,
            "relation": "eq"
        },
        "max_score": null,
        "hits": []
    }
}
通过测试发现:name和description都支持全文检索,pic不可作为查询条件。
修改映射

只能在创建index时手动建立mapping,或者新增field mapping,但是不能update field mapping。
因为已有数据按照映射早已分词存储好。如果修改,那这些存量数据怎么办。
新增一个字段mapping

PUT localhost:9200/book/_mapping
{
    "properties":{
        "ISBN":{
            "type":"text",
            "fields":{
                "raw":{
                    "type":"keyword"
                }
            }
        }
    }
}

   修改数据

PUT localhost:9200/book/_doc/1
{
  "name":"Java核心技术",
  "description":"本书由拥有20多年教学与研究经验的资深Java技术专家撰写(获Jolt大奖),是程序员的优选Java指南。本版针对Java SE 9、10和 11全面更新。",
  "pic":"item.jd.com",
  "publish_time":"2022-04-19",
  "ISBN":"12800420"
}

  搜索ISBN

GET localhost:9200/book/_search?q=ISBN:12800420

{
    "took": 949,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 0.2876821,
        "hits": [
            {
                "_index": "book",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.2876821,
                "_source": {
                    "name": "Java核心技术",
                    "description": "本书由拥有20多年教学与研究经验的资深Java技术专家撰写(获Jolt大奖),是程序员的优选Java指南。本版针对Java SE 9、10和 11全面更新。",
                    "pic": "item.jd.com",
                    "publish_time": "2022-04-19",
                    "ISBN": "12800420"
                }
            }
        ]
    }
}
分词器的介绍和使用 什么是分词器

分词器 接受一个字符串作为输入,将这个字符串拆分成独立的词或 语汇单元(token) (可能会丢弃一些标点符号等字符),然后输出一个 语汇单元流(token stream) 。

有趣的是用于词汇 识别 的算法。 whitespace (空白字符)分词器按空白字符 —— 空格、tabs、换行符等等进行简单拆分 —— 然后假定连续的非空格字符组成了一个语汇单元。

将用户输入的一段文本,按照一定逻辑,分析成多个词语的一种工具。常用的内置分词器
standard analyzer、simple analyzer、whitespace analyzer、stop analyzer、language analyzer、pattern analyzer

standard analyzer

标准分析器是默认分词器,如果未指定,则使用该分词器。

POST http://127.0.0.1:9200/_analyze
{
 "analyzer":"standard",
 "text":"我是程序员"
}


{
    "tokens": [
        {
            "token": "我",
            "start_offset": 0,
            "end_offset": 1,
            "type": "",
            "position": 0
        },
        {
            "token": "是",
            "start_offset": 1,
            "end_offset": 2,
            "type": "",
            "position": 1
        },
        {
            "token": "程",
            "start_offset": 2,
            "end_offset": 3,
            "type": "",
            "position": 2
        },
        {
            "token": "序",
            "start_offset": 3,
            "end_offset": 4,
            "type": "",
            "position": 3
        },
        {
            "token": "员",
            "start_offset": 4,
            "end_offset": 5,
            "type": "",
            "position": 4
        }
    ]
}
simple analyzer

simple 分析器当它遇到只要不是字母的字符,就将文本解析成 term,而且所有的 term 都是小写的。

POST http://127.0.0.1:9200/_analyze
{
  "analyzer":"simple",
  "text":"this is a book"
}

{
    "tokens": [
        {
            "token": "this",
            "start_offset": 0,
            "end_offset": 4,
            "type": "word",
            "position": 0
        },
        {
            "token": "is",
            "start_offset": 5,
            "end_offset": 7,
            "type": "word",
            "position": 1
        },
        {
            "token": "a",
            "start_offset": 8,
            "end_offset": 9,
            "type": "word",
            "position": 2
        },
        {
            "token": "book",
            "start_offset": 10,
            "end_offset": 14,
            "type": "word",
            "position": 3
        }
    ]
}

whitespace analyzer

whitespace 分析器,当它遇到空白字符时,就将文本解析成terms

POST  http://127.0.0.1:9200/_analyze
{
  "analyzer":"whitespace",
  "text":"this is a book"
}

{
    "tokens": [
        {
            "token": "this",
            "start_offset": 0,
            "end_offset": 4,
            "type": "word",
            "position": 0
        },
        {
            "token": "is",
            "start_offset": 5,
            "end_offset": 7,
            "type": "word",
            "position": 1
        },
        {
            "token": "a",
            "start_offset": 8,
            "end_offset": 9,
            "type": "word",
            "position": 2
        },
        {
            "token": "book",
            "start_offset": 10,
            "end_offset": 14,
            "type": "word",
            "position": 3
        }
    ]
}

stop analyzer

stop 分析器 和 simple 分析器很像,唯一不同的是,stop 分析器增加了对删除停止词的支持,默认使用了 english 停止词

stopwords 预定义的停止词列表,比如 (the,a,an,this,of,at)等等。

POST  http://127.0.0.1:9200/_analyze

{
  "analyzer":"stop",
  "text":"this is a book"
}

{
    "tokens": [
        {
            "token": "book",
            "start_offset": 10,
            "end_offset": 14,
            "type": "word",
            "position": 3
        }
    ]
}
中文分词器
安装
下载地址:https://github.com/medcl/elasticsearch-analysis-ik/releases。解压到 es/plugins/ik中,如图:

ik分词器的使用

  • ik_max_word: 会将文本做最细粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,共和国,共和,国,国歌”,会穷尽各种可能的组合;
  • ik_smart: 会做最粗粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”。

配置文件

文件描述
IKAnalyzer.cfg.xml用来配置自定义词库
main.dicik原生内置的中文词库,总共有27万多条,只要是这些单词,都会被分在一起
preposition.dic介词
quantifier.dic放了一些单位相关的词,量词
suffix.dic放了一些后缀
surname.dic中国的姓氏
stopword.dic英文停用词

 IKAnalyzer.cfg.xml




	IK Analyzer 扩展配置
	
	
	 
	
	
	
	
	
POST localhost:9200/_analyze
{
    "analyzer":"ik_smart",
    "text":"中华人民共和国国歌"
}


{
    "tokens": [
        {
            "token": "中华人民共和国",
            "start_offset": 0,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "国歌",
            "start_offset": 7,
            "end_offset": 9,
            "type": "CN_WORD",
            "position": 1
        }
    ]
}
POST localhost:9200/_analyze
{
    "analyzer":"ik_max_word",
    "text":"中华人民共和国国歌"
}

{
    "tokens": [
        {
            "token": "中华人民共和国",
            "start_offset": 0,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "中华人民",
            "start_offset": 0,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "中华",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "华人",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "人民共和国",
            "start_offset": 2,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 4
        },
        {
            "token": "人民",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 5
        },
        {
            "token": "共和国",
            "start_offset": 4,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 6
        },
        {
            "token": "共和",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 7
        },
        {
            "token": "国",
            "start_offset": 6,
            "end_offset": 7,
            "type": "CN_CHAR",
            "position": 8
        },
        {
            "token": "国歌",
            "start_offset": 7,
            "end_offset": 9,
            "type": "CN_WORD",
            "position": 9
        }
    ]
}
自定义词库
POST localhost:9200/_analyze
{
    "analyzer":"ik_smart",
    "text":"魔兽世界"
}


//分词
{
    "tokens": [
        {
            "token": "魔兽",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "世界",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        }
    ]
}

创建mydic.dic 文件,添加内容“魔兽世界”

修改IKAnalyzer.cfg.xml后重启ES,在测试分词效果




	IK Analyzer 扩展配置
	
	mydic.dic
	 
	

POST localhost:9200/_analyze
{
    "analyzer":"ik_smart",
    "text":"魔兽世界"
}

//分词
{
    "tokens": [
        {
            "token": "魔兽世界",
            "start_offset": 0,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 0
        }
    ]
}

转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/820602.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号