ES入门之CURD-2_大数据系统

ES入门之CURD-2
1. 为什么ES不将数据存储为二进制文件，而使用JSON格式存储
答：ES不仅要存储文档数据，还做对其创建索引

2. 三个基本概念
  > mapping: 映射，处理数据中每个字段如何存储
  > analysis：分析，全文如何被索引到
  > Query DSL：ES提供的灵活强大的检索语言
  > 提醒：倒排索引检索中间件，匹配规则都是：模糊匹配+source。与Mysql等结构化匹配都不同

3. select *
  > GET /_search ： 查询所有的文档
  > GET /_index1,index2/_search ：查询索引index1和index2下的所有文档
  > GET /_type1,type2,_type3*/_search ：查询类型type1、type2、type3下的所有文档
  > GET /_all/tweet*/_search ：查询所有索引下的tweet开头类型下的所有文档
  - 以此类推 GET /_index/_type/_search，支持正则匹配，多个_index、_type使用","隔开
  注意： url中每个位置不可以空缺，使用_all填充。例如：/_all/_type/_search

4. select * from table offset 1 limit 10
  > GET /_search?size=10&from=1
  重点： 结构化数据库和分布式系统查询，在分页中，我们都知道，随着分页深度的增加，排序的成本成指数增
  加，分布式系统和ElasticSearch的分片存储，类似Mysql的分库分表，排序的成本都是极高的。所以建议Web型
  检索对任何结果都不要超过1000。（查询结果不要超过1000，不是分页大小，可以理解为结果集不要超过1000，
  随着分页深度的增加，排序后，丢弃无用数据也在指数增加（目标分页前的数据都将丢弃））

5. 请求参数：q
  > GET /_search?q=+name:John
  - “+”：标识必须匹配
  - “-”：标识必须不可以匹配
  - “” ：不加标识，表示 或
  - 如果同时使用“+”“-”但符号一定要编码

6. 原理
  > 倒排索引
  > 分析：文档分词+标准化分词 （简单解释：字符过滤器-去除非字符、分词器-分 词、Token过滤器-大小写转换）
  > 映射：
      - 一般而言：只指定type属性就行
      - "type":"string":
          index: analyzed(全文检索)、no_analyzed(精确匹配)、no(不被索引到)
          analyzed：whitespace 、 simple 和 english

7. 请求体查询：
  > query:
  > match_all:
  > match:
    {
        "query": {
            "match": {
                "tweet": "elasticsearch"
            }
        }
    }
    {
        "query": {
            "match_all": {}
        }
    }
  > bool:
  > bool.must:
  > bool.should:
  > bool.must_not:
  > bool.filter:
    {
        "bool": {
            "must": { "match":   { "email": "business opportunity" }},
            "should": [
                { "match":       { "starred": true }},
                { "bool": {
                    "must":      { "match": { "folder": "inbox" }},
                    "must_not":  { "match": { "spam": true }}
                }}
            ],
            "minimum_should_match": 1
        }
    }
  > multi_match:  匹配多个字段同一查询
    {
        "multi_match": {
            "query":    "full text search",
            "fields":   [ "title", "body" ]
        }
    }
  > range: 范伟查询
  {
      "range": {
          "age": {
              "gte":  20,
              "lte":   30,
              "lt":   30,
              "gt":   30
          }
      }
  }
  > term: 精准匹配单个值（精准匹配）
  {
      "term": {
          "age": 26
      }
  }
  > terms: 精准匹配多个值（精准匹配）
  {
      "terms": {
          "tag": [
              "search",
              "full_text",
              "nosql"
          ]
      }
  }
  > exists：存在
  > missing：不存在
  {
      "exists":   {
          "field":    "title"
      }
  }
  > constant_score: 同filter，性能相同

8. 匹配和过滤
  - 匹配： 指的是相似性，携带_score （由于需要计算相关性，性能较差【都是基于同等量的数据】）
  - 过滤： 指的是完全比匹配（只获取yes、no的结果，性能较好）

9. 复合查询
  {
      "bool": {
          "must":     { "match": { "title": "how to make millions" }},
          "must_not": { "match": { "tag":   "spam" }},
          "should": [
              { "match": { "tag": "starred" }}
          ],
          "filter": {
            "bool": {
                "must": [
                    { "range": { "date": { "gte": "2014-01-01" }}},
                    { "range": { "price": { "lte": 29.99 }}}
                ],
                "must_not": [
                    { "term": { "category": "ebooks" }}
                ]
            }
          }
      }
  }

10. 查询分析
  > GET /gb/tweet/_validate/query  :  _validate-校验Api合法性
  > GET /gb/tweet/_validate/query?explain : explain-分析查询效率，等同于Mysql的explain
ES入门之CURD-2

大数据系统相关栏目本月热门文章