elasticstack学习 part2

catelog

基于词项和全文的搜索结构化搜索搜索的相关性算分query filter 与多字符串单字符串多字段查询单字符串多字段查询multimatch实战search Template 和 alias

基于词项和全文的搜索

基于term查询

索引时desc字段使用了分词器, 索引时转换成了小写的iphone, term查询desc因为没有用分词器,最终使用的term为iPhone查询, 所以没有查到
如果使用term要查询到, 就需要查询分词后的term 或者对该字段的keyword进行查询

POST /products/_bulk
{ "index": { "_id": 1 }}
{ "productID" : "XHDK-A-1293-#fJ3","desc":"iPhone" }
{ "index": { "_id": 2 }}
{ "productID" : "KDKE-B-9947-#kL5","desc":"iPad" }
{ "index": { "_id": 3 }}
{ "productID" : "JODL-X-1937-#pV7","desc":"MBP" }


POST products/_search
{
  "query": {
    "term": {
      "desc": {
        "value": "iPhone"
      }
    }
  },"profile": "true"
}

分词效果

term查询也会进行算分, 即使是keyword字段, constant_score filter取消算分, 减少性能消耗, 利用缓存

POST /products/_search
{
  "explain": true,
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "productID.keyword": "XHDK-A-1293-#fJ3"
        }
      }

    }
  }
}

基于全文

match query 会将查询的目标进行分词成term 每个term单独进行查询, 汇总结果 ;
match phrase 会将单词视为一个整体, 并且关注位置关系, 使用slot进行偏差

结构化搜索

对布尔值进行搜索

POST products/_search
{
  "query": {
    "term": {
      "avaliable": {
        "value": "true"
      }
    }
  }
}

POST products/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "avaliable": true
        }
      }
    }
  }
}

数字

"query": {
    "constant_score": {
      "filter": {
        "range": {
          "price": {
            "gte": 10,
            "lte": 20
          }
        }
      }
    }
  }

对日期进行搜索
当前时间减去4年, 也就是搜索4年以内的

"query": {
    "constant_score": {
      "filter": {
        "range": {
          "date": {
            "gte" : "now-4y"
          }
        }
      }
    }
  }

查询非空

"query": {
    "constant_score": {
      "filter": {
        "exists": {
          "field": "date"
        }
      }
    }
  }

查询多值字段
查询类型包含comedy的 , 而不是精确只有comedy

POST movies/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "genre.keyword": "Comedy"
        }
      }
    }
  }
}

搜索的相关性算分

tm 词频 , 词项在该文档中的频率 , 例如我是中国人, 生在中国, 中国出现了2次;
df 检索词在所有文档中的频率 , 翻转idf, 例如中国在200个文档中出现过, 一共有1000个文档, log(1000/200)

idf 词与该文档的差异率 ,
lucene使用tm -idf,idf加权tm求分 , 之后改为了bm25, 解决了tf 无限增加分值无限增大的问题 , es可以在创建索引时指定算分方式

explain解析算分细节
两条都包含目标, 但是id2的文档长度更短, tf分值更高

PUT testscore/_bulk
{ "index": { "_id": 1 }}
{ "content":"we use Elasticsearch to power the search" }
{ "index": { "_id": 2 }}
{ "content":"we like elasticsearch" }
{ "index": { "_id": 3 }}
{ "content":"The scoring of documents is caculated by the scoring formula" }
{ "index": { "_id": 4 }}
{ "content":"you know, for search" }

POST testscore/_search
{
  "query": {
    "match": {
      "content": "elasticsearch"
      
    }
  },"explain": true
}

使用boosting来控制算分结果, 例negative对包含like的文档降权,

POST testscore/_search
{
    "query": {
        "boosting" : {
            "positive" : {
                "term" : {
                    "content" : "elasticsearch"
                }
            },
            "negative" : {
                 "term" : {
                     "content" : "like"
                }
            },
            "negative_boost" : 0.2
        }
    }
}

query filter 与多字符串

bool查询, 组合多个字段的查询条件
must, should参与评分,filter和mustnot不参与评分

#基本语法
POST /products/_search
{
  "query": {
    "bool" : {
      "must" : {
        "term" : { "price" : "30" }
      },
      "filter": {
        "term" : { "avaliable" : "true" }
      },
      "must_not" : {
        "range" : {
          "price" : { "lte" : 10 }
        }
      },
      "should" : [
        { "term" : { "productID.keyword" : "JODL-X-1937-#pV7" } },
        { "term" : { "productID.keyword" : "XHDK-A-1293-#fJ3" } }
      ],
      "minimum_should_match" :1
    }
  }
}

单字符串多字段查询

使用disjunction_max来查询多字段, 对比各个字段评分, 取最高评分

PUT /blogs/_doc/1
{
    "title": "Quick brown rabbits",
    "body":  "Brown rabbits are commonly seen."
}

PUT /blogs/_doc/2
{
    "title": "Keeping pets healthy",
    "body":  "My quick brown fox eats rabbits on a regular basis."
}
POST blogs/_search
{
  "query": {
    "dis_max": {
      "queries": [
        {"match": {
          "title": "Brown fox"
        }},
        {
          "match": {
            "body": "Brown fox"
          }
        }
      ]
    }
  },
  "explain": true
}

例id为1的虽然在两个字段中都包括了brown, 但是两个字段的brown结果最终取了一个最大的分值, id为2的是将brown和fox两个分值加起来, fox只在一个文档中出现, 更罕见, 理所应当分值更高, 所以id为2的更符合要求, 分值更大 , 排在前面

如果只搜索Quick pets , 两个文档评分相同, 因为每个文档包含的单词都是相同的
使用tie_breaker对评分更均衡, 之前是只取最高字段, tie_breaker会加权其他字段并加入总分,

POST blogs/_search
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Quick pets" }},
                { "match": { "body":  "Quick pets" }}
            ],
            "tie_breaker": 0.2
        }
    }
}

如果是考虑多个字段的算分, 自己感觉可以直接用bool替代

POST /blogs/_search
{
    "query": {
        "bool": {
            "should": [
                { "match": { "title": "Quick pets" }},
                { "match": { "body":  "Quick pets" }}
            ]
        }
    },"explain": true
}

单字符串多字段查询multimatch

例用barking dogs只查询title结果为id1分值高, 因为文档短, 但实际id2更符合搜索目标, 针对这种场景, 需要增加id2的分值, 增加title.std字段, 对两个字段查询
multimatch在写法上比dis_max更简单, 默认使用best_fields,也就是disjunction_max,

DELETE /titles
PUT /titles
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "english",
        "fields": {"std": {"type": "text","analyzer": "standard"}}
      }
    }
  }
}

POST titles/_bulk
{ "index": { "_id": 1 }}
{ "title": "My dog barks" }
{ "index": { "_id": 2 }}
{ "title": "I see a lot of barking dogs on the road " }

GET titles/_search
{
  "query": {
    "match": {
      "title": "barking dogs"
    }
  }
}


GET titles/_search
{
  "query": {
    "multi_match": {
      "type": "most_fields",
      //"type": "best_fields", 
      "query": "barking dogs",
      "fields": ["title","title.std"]
    }
  }
}

实战

将tmdb导入es中, mapping中title使用english分词器, 通过multi_match用token ““basketball with cartoon aliens”” 搜索出空中大灌篮 ;
如果是默认标准分词器进行索引, 搜索不出结果
multi_match默认使用的best_fields模式, 仅使用最高分的字段的分数

"mappings": {
    "properties": {
      "overview": {
        "type": "text",
        "analyzer": "english",
        "fields": {
          "std": {
            "type": "text",
            "analyzer": "standard"
          }
        }
      },
      "popularity": {
        "type": "float"
      },
      "title": {
        "type": "text",
        "analyzer": "english",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }

"query": {
    "multi_match": {
      "query": "basketball with cartoon aliens",
      "fields": ["title^10","overview"]
    }
  }

Windows 安装pyenv

pip install pyenv-win --target %USERPROFILE%/.pyenv

如何使用pyenv在windows10安装多个python版本环境
我环境变量有问题, 直接在pyenv的目录下cmd

pyenv install 2.7.15
pyenv versions
python -V
pyenv global 2.7.15
pyenv global

默认mapping , 默认查询

mapping , english分词器, most_filed 模式

默认mapping , 默认查询, space jam只有basketball命中, 文档中的alien 因为分词器保持了aliens 就没有命中

search Template 和 alias

查询模板, 通过预置查询脚本 , 之后查询就可以引用该模板, 还可以引用变量, 之后可以直接修改模板, 改变查询结果

POST tmdb/_search
{
   "_source": ["title","overview"],
      "size":20,
      "query": {
          "multi_match": {
              "type": "most_fields", 
              "query": "basketball with cartoon aliens",
              "fields": ["title","overview"]
          }
      }
  ,"explain": true
}

POST _scripts/tmdb
{
  "script": {
    "lang": "mustache",
    "source": {
      "_source": [
        "title","overview"
      ],
      "size": 20,
      "query": {
        "multi_match": {
          "query": "{{q}}",
          "fields": ["title","overview"]
        }
      }
    }
  }
}

POST tmdb/_search/template
{
    "id":"tmdb",
    "params": {
        "q": "basketball with cartoon aliens"
    }
}

别名

elasticstack学习 part2

大数据系统相关栏目本月热门文章