elastic-基础知识_大数据系统

数据类型

数值型 - integer, long
文本型 - text, 会被分词处理
布尔型 - boolean
关键字 - keyword
数组，没有数组这个类型，但是支持数组这个结构
自动补全类型 - completion，支持添加文本类型并按类型进行筛选

dynamic mapping 动态构建mapping

什么时候会动态构建mapping？
不显性的创建索引，直接使用put向索引中加入文档
默认Mapping支持dynamic,即dynamic=true，文档中加入新的字段时，
该字段可以被搜索，数据也在_source中出现
当dynamic=false，数据会存在_source中，但是字段不能被搜索
当dynamic=true，put文档时如果含有新字段，会报错
动态构建mapping时会为字段默认创建一个子字段，字段的名称为keyword，type值也是keyword

生产环境中为什么要手动创建索引，而不是去动态的构建mapping

需要对这个建立索引的过程做更多的控制：设置合适的分片数，字段类型，分词器等

es是无法直接查询null值的，怎样可以实现这样的功能呢？

在定义索引的mapping结构的时候，定义为空时的默认值null_value。查询的时候使用默认的null_value进行查询

定义mapping时，字段的copy_to属性提供了什么样的能力？

它可以将多个字段拼接成一个字段，这个字段不会存在于source中，但是可以用于搜索。
copy_to生成的单字段可以实现多字段的匹配搜索能力。
例如：POST users/_search
PUT users
{
“mappings”: {
“properties”: {
“firstName”:{
“type”: “text”,
“copy_to”: “fullName”
},
“lastName”:{
“type”: “text”,
“copy_to”: “fullName”
}
}
}
}
{
“query”: {
“match”: {
“fullName”:{
“query”: “Ruan Yiming”,
“operator”: “and”
}
}
}
}
这个例子中，我们想找到firstName=Ruan，lstName=Yiming的文档，使用了fullName进行单字段匹配+ “operator”: "and"可以实现。但是如果没有这个字段，就需要使用多字段实现了。

过滤器analyzer

analyze工作流程图

参考文档：
https://blog.csdn.net/michaelwubo/article/details/82218335
tokenizer的常用类型有：standard， keyword， whitespace，path_hierarchy(文档路径分词)
filer的常用类型有：stop，snowball(英文处理，动词的变形转成原型)
char_filter的常用类型有：mapping(直接映射)，pattern_replace(正则匹配)

什么是Dynamic Template

动态设定字段类型的模版，构建索引时可以指定。例如，以“is”开头的字段可以默认构建成boolean类型。注意match_mapping_type, 可以指定为string，但是es的字段mapping中的mapping是不可以设置成string类型的。

PUT my_index
{
  "mappings": {
    "dynamic_templates": [
            {
        "strings_as_boolean": {
          "match_mapping_type":   "string",
          "match":"is*",
          "mapping": {
            "type": "boolean"
          }
        }
      }
    ]
  }
}

当你向一个不存在的索引添加数据时，默认的字段映射是怎样的呢？
数值型识别会关闭，“111”会被识别成text类型， 111 会被识别成long类型
布尔型识别关闭，“ture”会被识别成text类型，ture会被识别成布尔类型
日期识别开启，“2019-01-01”会被识别成date类型

什么是Index Template

索引模版，帮助创建索引时默认设置setting和mapping中的一些属性。模板仅在一个索引被新创建时，才会产生作用。修改模板不会影响已创建的索引，你可以设定多个索引模板，这些设置会被“merge”在一起。你可以指定“oder”的数值，控制“merging”的过程
index Templte的工作流程

应用Elasticsearch默认的setting和mapping
应用order数值低的Index Template中的设定
应用order高的Index Template中的设定，之前的设定会被覆盖
应用创建索引时，用户所指定的Setting和Mapping，并覆盖之前模板中的设定

文档的基本 CRUD 与批量操作

插入和更新
方式一：不指定create还是update。效果：有则替换，无则插入。底层实现，先删除，后写入

PUT users/_doc/1
{
     "user" : "Jack",
    "post_date" : "2019-05-15T14:12:12",
    "message" : "trying out Elasticsearch"
}

方式二：指定create或是update

#create document. 指定Id。如果id已经存在，报错
PUT users/_create/2
{
    "user" : "Jack",
    "post_date" : "2019-05-15T14:12:12",
    "message" : "trying out Elasticsearch"
}

#字段存在则更新，字段不存在则插入新字段。
POST users/_update/1/
{
    "doc":{
        "post_date" : "2019-05-15T14:12:12",
        "message" : "trying out Elasticsearch"
    }
}

删除

#通过文档id删除
DELETE users/_doc/1

Bulk 操作
可以单批次执行多条，不同操作类型的指令，如下

POST _bulk
{ "index" : { "_index" : "test", "_id" : "1" } }
{ "field1" : "value1" }
{ "delete" : { "_index" : "test", "_id" : "2" } }
{ "create" : { "_index" : "test2", "_id" : "3" } }
{ "field1" : "value3" }
{ "update" : {"_id" : "1", "_index" : "test"} }
{ "doc" : {"field2" : "value2"} }

mget批量获取
索引名称可以放在url中指定或者放在json中指定，甚至可以指定获取的source字段

#URI中指定index
GET /test/_mget
{
    "docs" : [
        {

            "_id" : "1"
        },
        {

            "_id" : "2"
        }
    ]
}
#指定获取的source字段
GET /_mget
{
    "docs" : [
        {
            "_index" : "test",
            "_id" : "1",
            "_source" : false
        },
        {
            "_index" : "test",
            "_id" : "2",
            "_source" : ["field3", "field4"]
        },
        {
            "_index" : "test",
            "_id" : "3",
            "_source" : {
                "include": ["user"],
                "exclude": ["user.location"]
            }
        }
    ]
}

msearch操作

msearch 操作

单批次查询多个索引数据

POST kibana_sample_data_ecommerce/_msearch
{}
{"query" : {"match_all" : {}},"size":1}
{"index" : "kibana_sample_data_flights"}
{"query" : {"match_all" : {}},"size":2}

倒排索引入门分词器测试

standard，系统默认分词器，可处理英文和中文
Simple Analyzer – 按照非字母切分（符号被过滤），小写处理
Stop Analyzer – 小写处理，停用词过滤（the，a，is）
Whitespace Analyzer – 按照空格切分，不转小写
Keyword Analyzer – 不分词，直接将输入当作输出
Patter Analyzer – 正则表达式，默认 W+ (非字符分隔)
Language – 提供了30多种常见语言的分词器
2 running Quick brown-foxes leap over lazy dogs in the summer evening

查看不同的analyzer的效果

#standard，可用于英文，按单词分词，大写转小写，去除标点符号
GET _analyze
{
  "analyzer": "standard",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

#simpe，可用于英文，按单词分词，大写转小写，去除数字，标点符号
GET _analyze
{
  "analyzer": "simple",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}


#stop，可用于英文，按单词分词，大写转小写，去除数字，冠词，标点符号
GET _analyze
{
  "analyzer": "stop",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}


#whitespace，可用于英文，按空格分词。无其他过滤。
GET _analyze
{
  "analyzer": "whitespace",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

#keyword，不分词
GET _analyze
{
  "analyzer": "keyword",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}


#pattern，可用于英文，按单词分词，大写转小写，去除标点符号
GET _analyze
{
  "analyzer": "pattern",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}


#english，可用于英文，按单词分词，大写转小写，去除标点符号，词性转化
GET _analyze
{
  "analyzer": "english",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

#icu_analyzer，用于中文分词
POST _analyze
{
  "analyzer": "icu_analyzer",
  "text": "他说的确实在理”"
}

#standard，可用于中文分词，按单个中文汉字分词
POST _analyze
{
  "analyzer": "standard",
  "text": "他说的确实在理”"
}

如何查看es中某个字段的分词效果

GET test2/_termvectors/2?fields=test_text 返回字段分词后的统计信息

什么是term查询

基于 Term 的查询
● Term 的重要性
● Term 是表达语意的最⼩小单位。搜索和利利⽤用统计语⾔言模型进⾏行行⾃自然语⾔言处理理都需要处理理 Term
● 特点
● Term Level Query: Term Query / Range Query / Exists Query / Prefix Query /Wildcard Query
● 在 ES 中，Term 查询，对输⼊入不不做分词。会将输⼊入作为⼀一个整体，在倒排索引中查找准确的词项，并且使⽤用相关度算分公式为每个包含该词项的⽂文档进⾏行行相关度算分 – 例例如“Apple Store”
● 可以通过 Constant Score 将查询转换成⼀一个 Filtering，避免算分，并利利⽤用缓存，提⾼高性能

复合查询 – Constant Score 转为 Filter

将 Query 转成 Filter，忽略略 TF-IDF 计算，避免相关性算分的开销
Filter 可以有效利利⽤用缓存

可以压测下这两个请求方式对性的开销。

elasticsearch直接提供联想词的能力

PUT articles
{
  "mappings": {
    "properties": {
      "title_completion":{
        "type": "completion"
      }
    }
  }
}

POST articles/_bulk
{ "index" : { } }
{ "title_completion": "lucene is very cool"}
{ "index" : { } }
{ "title_completion": "Elasticsearch builds on top of lucene"}
{ "index" : { } }
{ "title_completion": "Elasticsearch rocks"}
{ "index" : { } }
{ "title_completion": "elastic is the company behind ELK stack"}
{ "index" : { } }
{ "title_completion": "Elk stack rocks"}
{ "index" : {} }


POST articles/_search?pretty
{
  "size": 0,
  "suggest": {
    "article-suggester": {
      "prefix": "elk ",
      "completion": {
        "field": "title_completion"
      }
    }
  }
}

什么是Term & Phrase Suggester

elasticsearch 提供的联想词能力
3种Suggestion Mode
Missing – 如索引中已经存在，就不提供建议

Popular – 推荐出现频率更加⾼的词

Always – ⽆论是否存在，都提供建议

should用法

should和must放在同一层级时，并不要求should中的内容一定要被匹配到，只是匹配中的话，相关性的得分会更高。
可以加一个"minimum_should_match": n，来控制必须匹配中的should条件的个数。
如果bool中只用should集合没有must集合，那么必须满足should

解决数组中内容精确匹配的问题

#改变数据模型，增加字段。解决数组包含而不是精确匹配的问题
POST /newmovies/_bulk
{ "index": { "_id": 1 }}
{ "title" : "Father of the Bridge Part II","year":1995, "genre":"Comedy","genre_count":1 }
{ "index": { "_id": 2 }}
{ "title" : "Dave","year":1993,"genre":["Comedy","Romance"],"genre_count":2 }

#must，有算分
POST /newmovies/_search
{
  "query": {
    "bool": {
      "must": [
        {"term": {"genre.keyword": {"value": "Comedy"}}},
        {"term": {"genre_count": {"value": 1}}}

      ]
    }
  }
}

如何实现should_not的能力

#嵌套，实现了 should not 逻辑
POST /products/_search
{
  "query": {
    "bool": {
      "must": {
        "term": {
          "price": "30"
        }
      },
      "should": [
        {
          "bool": {
            "must_not": {
              "term": {
                "avaliable": "false"
              }
            }
          }
        }
      ],
      "minimum_should_match": 1
    }
  }
}

boost的用处

权重提升值，影响算分

POST blogs/_search
{
  "query": {
    "bool": {
      "should": [
        {"match": {
          "title": {
            "query": "apple,ipad",
            "boost": 1.1
          }
        }},

        {"match": {
          "content": {
            "query": "apple,ipad",
            "boost": 1
          }
        }}
      ]
    }
  }
}

可以控制算分的匹配查询

boosting 中可以控制匹配中的关键字是正向的算分还是逆向的算分。
POST news/_search
{
“query”: {
“boosting”: {
“positive”: {
“match”: {
“content”: “apple”
}
},
“negative”: {
“match”: {
“content”: “pie”
}
},
“negative_boost”: 0.5
}
}
}

terms查询用法

terms查询接的是一个数组，文档中的字段只要匹配中一个就算匹配成功

filter查询中使用多个range条件查询

需要使用bool查询

POST products/_search
{
  "query": {
    "constant_score": {
      "filter": {
       "bool": {
         "must" : [
            {"range":{
              "price":{
                "gte":10
                }
              }
            },
             {"range":{
              "date":{
                "gte":"2018-01-01"
                }
              }
            }
          ]
       }
      },
      "boost": 1.2
    }
  }
}

elastic-基础知识

大数据系统相关栏目本月热门文章