基于词项和全文的搜索结构化搜索搜索的相关性算分query filter 与多字符串单字符串多字段查询单字符串多字段查询multimatch实战search Template 和 alias
基于词项和全文的搜索基于term查询
索引时desc字段使用了分词器, 索引时转换成了小写的iphone, term查询desc因为没有用分词器,最终使用的term为iPhone查询, 所以没有查到
如果使用term要查询到, 就需要查询分词后的term 或者对该字段的keyword进行查询
POST /products/_bulk
{ "index": { "_id": 1 }}
{ "productID" : "XHDK-A-1293-#fJ3","desc":"iPhone" }
{ "index": { "_id": 2 }}
{ "productID" : "KDKE-B-9947-#kL5","desc":"iPad" }
{ "index": { "_id": 3 }}
{ "productID" : "JODL-X-1937-#pV7","desc":"MBP" }
POST products/_search
{
"query": {
"term": {
"desc": {
"value": "iPhone"
}
}
},"profile": "true"
}
分词效果
term查询也会进行算分, 即使是keyword字段, constant_score filter取消算分, 减少性能消耗, 利用缓存
POST /products/_search
{
"explain": true,
"query": {
"constant_score": {
"filter": {
"term": {
"productID.keyword": "XHDK-A-1293-#fJ3"
}
}
}
}
}
基于全文
match query 会将查询的目标进行分词成term 每个term单独进行查询, 汇总结果 ;
match phrase 会将单词视为一个整体, 并且关注位置关系, 使用slot进行偏差
对布尔值进行搜索
POST products/_search
{
"query": {
"term": {
"avaliable": {
"value": "true"
}
}
}
}
POST products/_search
{
"query": {
"constant_score": {
"filter": {
"term": {
"avaliable": true
}
}
}
}
}
数字
"query": {
"constant_score": {
"filter": {
"range": {
"price": {
"gte": 10,
"lte": 20
}
}
}
}
}
对日期进行搜索
当前时间减去4年, 也就是搜索4年以内的
"query": {
"constant_score": {
"filter": {
"range": {
"date": {
"gte" : "now-4y"
}
}
}
}
}
查询非空
"query": {
"constant_score": {
"filter": {
"exists": {
"field": "date"
}
}
}
}
查询多值字段
查询类型包含comedy的 , 而不是精确只有comedy
POST movies/_search
{
"query": {
"constant_score": {
"filter": {
"term": {
"genre.keyword": "Comedy"
}
}
}
}
}
搜索的相关性算分
tm 词频 , 词项在该文档中的频率 , 例如 我是中国人, 生在中国, 中国出现了2次;
df 检索词在所有文档中的频率 , 翻转idf, 例如中国在200个文档中出现过, 一共有1000个文档, log(1000/200)
idf 词与该文档的差异率 ,
lucene使用tm -idf,idf加权tm求分 , 之后改为了bm25, 解决了tf 无限增加分值无限增大的问题 , es可以在创建索引时指定算分方式
explain解析算分细节
两条都包含目标, 但是id2的文档长度更短, tf分值更高
PUT testscore/_bulk
{ "index": { "_id": 1 }}
{ "content":"we use Elasticsearch to power the search" }
{ "index": { "_id": 2 }}
{ "content":"we like elasticsearch" }
{ "index": { "_id": 3 }}
{ "content":"The scoring of documents is caculated by the scoring formula" }
{ "index": { "_id": 4 }}
{ "content":"you know, for search" }
POST testscore/_search
{
"query": {
"match": {
"content": "elasticsearch"
}
},"explain": true
}
使用boosting来控制算分结果, 例negative对包含like的文档降权,
POST testscore/_search
{
"query": {
"boosting" : {
"positive" : {
"term" : {
"content" : "elasticsearch"
}
},
"negative" : {
"term" : {
"content" : "like"
}
},
"negative_boost" : 0.2
}
}
}
query filter 与多字符串
bool查询, 组合多个字段的查询条件
must, should参与评分,filter和mustnot不参与评分
#基本语法
POST /products/_search
{
"query": {
"bool" : {
"must" : {
"term" : { "price" : "30" }
},
"filter": {
"term" : { "avaliable" : "true" }
},
"must_not" : {
"range" : {
"price" : { "lte" : 10 }
}
},
"should" : [
{ "term" : { "productID.keyword" : "JODL-X-1937-#pV7" } },
{ "term" : { "productID.keyword" : "XHDK-A-1293-#fJ3" } }
],
"minimum_should_match" :1
}
}
}
单字符串多字段查询
使用disjunction_max来查询多字段, 对比各个字段评分, 取最高评分
PUT /blogs/_doc/1
{
"title": "Quick brown rabbits",
"body": "Brown rabbits are commonly seen."
}
PUT /blogs/_doc/2
{
"title": "Keeping pets healthy",
"body": "My quick brown fox eats rabbits on a regular basis."
}
POST blogs/_search
{
"query": {
"dis_max": {
"queries": [
{"match": {
"title": "Brown fox"
}},
{
"match": {
"body": "Brown fox"
}
}
]
}
},
"explain": true
}
例id为1的虽然在两个字段中都包括了brown, 但是两个字段的brown结果最终取了一个最大的分值, id为2的是将brown和fox两个分值加起来, fox只在一个文档中出现, 更罕见, 理所应当分值更高, 所以id为2的更符合要求, 分值更大 , 排在前面
如果只搜索Quick pets , 两个文档评分相同, 因为每个文档包含的单词都是相同的
使用tie_breaker对评分更均衡, 之前是只取最高字段, tie_breaker会加权其他字段并加入总分,
POST blogs/_search
{
"query": {
"dis_max": {
"queries": [
{ "match": { "title": "Quick pets" }},
{ "match": { "body": "Quick pets" }}
],
"tie_breaker": 0.2
}
}
}
如果是考虑多个字段的算分, 自己感觉可以直接用bool替代
POST /blogs/_search
{
"query": {
"bool": {
"should": [
{ "match": { "title": "Quick pets" }},
{ "match": { "body": "Quick pets" }}
]
}
},"explain": true
}
单字符串多字段查询multimatch
例用barking dogs只查询title结果为id1分值高, 因为文档短, 但实际id2更符合搜索目标, 针对这种场景, 需要增加id2的分值, 增加title.std字段, 对两个字段查询
multimatch在写法上比dis_max更简单, 默认使用best_fields,也就是disjunction_max,
DELETE /titles
PUT /titles
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "english",
"fields": {"std": {"type": "text","analyzer": "standard"}}
}
}
}
}
POST titles/_bulk
{ "index": { "_id": 1 }}
{ "title": "My dog barks" }
{ "index": { "_id": 2 }}
{ "title": "I see a lot of barking dogs on the road " }
GET titles/_search
{
"query": {
"match": {
"title": "barking dogs"
}
}
}
GET titles/_search
{
"query": {
"multi_match": {
"type": "most_fields",
//"type": "best_fields",
"query": "barking dogs",
"fields": ["title","title.std"]
}
}
}
实战
将tmdb导入es中, mapping中title使用english分词器, 通过multi_match用token ““basketball with cartoon aliens”” 搜索出空中大灌篮 ;
如果是默认标准分词器进行索引, 搜索不出结果
multi_match默认使用的best_fields模式, 仅使用最高分的字段的分数
"mappings": {
"properties": {
"overview": {
"type": "text",
"analyzer": "english",
"fields": {
"std": {
"type": "text",
"analyzer": "standard"
}
}
},
"popularity": {
"type": "float"
},
"title": {
"type": "text",
"analyzer": "english",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
"query": {
"multi_match": {
"query": "basketball with cartoon aliens",
"fields": ["title^10","overview"]
}
}
Windows 安装pyenv
pip install pyenv-win --target %USERPROFILE%/.pyenv
如何使用pyenv在windows10安装多个python版本环境
我环境变量有问题, 直接在pyenv的目录下cmd
pyenv install 2.7.15 pyenv versions python -V pyenv global 2.7.15 pyenv global
默认mapping , 默认查询
mapping , english分词器, most_filed 模式
默认mapping , 默认查询, space jam只有basketball命中, 文档中的alien 因为分词器保持了aliens 就没有命中
查询模板, 通过预置查询脚本 , 之后查询就可以引用该模板, 还可以引用变量, 之后可以直接修改模板, 改变查询结果
POST tmdb/_search
{
"_source": ["title","overview"],
"size":20,
"query": {
"multi_match": {
"type": "most_fields",
"query": "basketball with cartoon aliens",
"fields": ["title","overview"]
}
}
,"explain": true
}
POST _scripts/tmdb
{
"script": {
"lang": "mustache",
"source": {
"_source": [
"title","overview"
],
"size": 20,
"query": {
"multi_match": {
"query": "{{q}}",
"fields": ["title","overview"]
}
}
}
}
}
POST tmdb/_search/template
{
"id":"tmdb",
"params": {
"q": "basketball with cartoon aliens"
}
}
别名



