Java手写笔记-分布式搜索引擎ElasticSearch（作者原创）

个人主页

gitee
GitHub

个人简介

作者是一个来自河源的大三在校生，以下笔记都是作者自学之路的一些浅薄经验，如有错误请指正，将来会不断的完善笔记，帮助更多的Java爱好者入门。

ElasticSearch7.6.1笔记 ElasticSearch概念

elasticsearch是一个实时的分布式全文检索引擎，elasticsearch是由Lucene作为底层构建的，elasticsearch采用的不是一般的正排索引（类似于mysql索引），而是用倒排索引，好处是模糊搜索速度极快。。。

elasticsearch的操作都是使用JSON格式发送请求的

ElasticSearch的底层索引

我们知道mysql的like可以作为模糊搜索，但是速度是很慢的，因为mysql的like模糊搜索不走索引，因为底层是正排索引，所谓的正排索引，也就是利用完整的关键字去搜索。。。。而elasticsearch的倒排索引则就是利用不完整的关键字去搜索。原因是elasticsearch利用了“分词器”去对每个document分词（每个字段都建立了一个倒排索引，除了documentid），利用分出来的每个词去匹配各个document

比如：在索引名为hello下，有三个document

documentid age name

1 18 张三

2 20 李四

3 18 李四

此时建立倒排索引：

第一个倒排索引：

age

18 1 , 3

20 2

第二个倒排索引：

name

张三 1

李四 2 , 3

elasticsearch和关系型数据库（MySQL）

我们暂且可以把es和mysql作出如下比较

mysql数据库（database） ========== elasticsearch的索引（index）

mysql的表（table）==============elasticsearch的type（类型）======后面会被废除

mysql的记录 =========== elasticsearch的文档（document）

mysql的字段 ============= elasticsearch的字段（Field）

elasticsearch的一些注意点*** 跨域问题

打开elasticsearch的config配置文件elasticsearch.yml

并在最下面添加如下：

http.cors.enabled: true
http.cors.allow-origin: "*"

占用内存过多导致卡顿问题

因为elasticsearch是一个非常耗资源的，从elasticsearch的配置jvm配置文件就可以看到，elasticsearch默认启动就需要分配给jvm1个g的内存。我们可以对它进行修改

打开elasticsearch的jvm配置文件jvm.options

找到：

-Xms1g    //最小内存
-Xms1g    //最大内存

修改成如下即可：

-Xms256m
-Xms512m

elasticsearch和kibana版本问题

如果在启动就报错，或者其他原因，我们要去看一看es和kibana的版本是否一致，比如es用的是7.6 ，那么kibana也要是7.6

ik分词器 ik分词器的使用

ik分词器是一种中文分词器，但是比如有一些词（例如人名）它是不会分词的，所以我们可以对它进行扩展。

要使用ik分词器，就必须下载ik分词器插件，放到elasticsearch的插件目录中，并以ik为目录名

ik分词器一共有两种分词方式：ik_smart , ik_max_word

ik_smart : 最少切分（尽可能少切分单词）

ik_max_word : 最多切分（尽可能多切分单词）

=============================

ik_smart :

GET _analyze     //  _analyze 固定写法
{
  "text": ["中国共产党"],
  "analyzer": "ik_smart"
  
}

结果：

{
  "tokens" : [
    {
      "token" : "中国共产党",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 0
    }
  ]
}

ik_max_word :

GET _analyze
{
  "text": ["中国共产党"],
  "analyzer": "ik_max_word"
  
}

结果：

{
  "tokens" : [
    {
      "token" : "中国共产党",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "中国",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "国共",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "共产党",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "共产",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "党",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 5
    }
  ]
}

ik分词器分词的扩展

GET _analyze
{
  "text": ["我是游政杰，very nice"],
  "analyzer": "ik_max_word"
}

{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "游",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "政",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "杰",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "very",
      "start_offset" : 6,
      "end_offset" : 10,
      "type" : "ENGLISH",
      "position" : 5
    },
    {
      "token" : "nice",
      "start_offset" : 11,
      "end_offset" : 15,
      "type" : "ENGLISH",
      "position" : 6
    }
  ]
}

人名没有分正确。我们可以新建一个配置文件，去添加我们需要分的词

1.我们先去ik插件目录中找到IKAnalyzer.cfg.xml文件


	IK Analyzer 扩展配置
	
	     //如果有自己新建的dic扩展，就可以加到xxx.dic

2.创建my.dic，把自己需要分词的添加进去

比如我们想添加多“游政杰”这个分词，就可以在my.dic输入进去

3.重启所有服务即可

GET _analyze
{
  "text": ["我是游政杰，very nice"],
  "analyzer": "ik_max_word"
  
  
  
}

{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "游政杰",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "very",
      "start_offset" : 6,
      "end_offset" : 10,
      "type" : "ENGLISH",
      "position" : 3
    },
    {
      "token" : "nice",
      "start_offset" : 11,
      "end_offset" : 15,
      "type" : "ENGLISH",
      "position" : 4
    }
  ]
}

elasticsearch的操作（REST风格）

下面的操作使用Kibana作为可视化工具去操作es ,也可以使用postman去操作

method url地址描述
PUT localhost:9100/索引名称/类型名称/文档id 创建文档（指定id）
POST localhost:9100/索引名称/类型名称创建文档（随机id）
POST localhost:9100/索引名称/文档类型/文档id/_update 修改文档
DELETE localhost:9100/索引名称/文档类型/文档id 删除文档
GET localhost:9100/索引名称/文档类型/文档id 查询文档通过文档id
POST localhost:9100/索引名称/文档类型/_search 查询所有文档

可以看到，elasticsearch和原生的RESTful风格有点不同，区别是PUT和POST，原生RestFul风格的PUT是用来修改数据的，POST是用来添加数据的，而这里相反

PUT和POST的区别：

PUT具有幂等性，POST不具有幂等性，也就是说利用PUT无论提交多少次，返回结果都不会发生改变，这就是具有幂等性，而POST我们可以把他理解为uuid生成id，每一次的id都不同，所以POST不具有幂等性

创建索引

模板：PUT /索引名

例1：

创建一个索引名为hello01，类型为_doc，documentid（记录id）为001的记录，PUT一定要指定一个documentid，如果是POST的话可以不写，POST是随机给documentid的，因为post是不具有幂等性的

PUT /hello03
{
  //请求体，为空就是没有任何数据
}

返回结果

{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "hello03"
}

删除索引

DELETE hello01
{
  
}

往索引插入数据（document）

PUT /hello03/_doc/1
{
  "name": "yzj",
  "age" : 18
  
}

结果:

{
  "_index" : "hello03",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

然后我们查看一下hello03的索引信息：

{
"state": "open",
"settings": {
"index": {
"creation_date": "1618408917052",
"number_of_shards": "1",
"number_of_replicas": "1",
"uuid": "OEVNL7cCQgG74KMPG5LjLA",
"version": {
"created": "7060199"
},
"provided_name": "hello03"
}
},
"mappings": {
"_doc": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"    //name的底层默认用了keyword（不可分词）
}
}
},
"age": {
"type": "long"  //age用了long
}
}
}
},
"aliases": [ ],
"primary_terms": {
"0": 1
},
"in_sync_allocations": {
"0": [
"17d4jyS9RgGEVid4rIANQA"
]
}
}

我们可以看到，如果我们没有指定字段类型，就会使用es默认提供的

例如上面的name，默认用了keyword，不可分词

所以我们很有必要在创建时就指定类型

删除索引中指定的数据（根据id）

DELETE hello01/_doc/004
{
  
}

修改索引中指定的数据

POST hello02/_update/001
{
  "doc": {
     "d2":"Java"
    
  }
 
}

删除索引中指定的数据

DELETE hello02/_doc/001
{
  
  
}

创建映射字段

PUT /hello05
{
  "mappings": {
    "properties": {
      "name":{
        "type": "text",
        "analyzer": "ik_max_word"
      },
      "say":{
        "type": "text",
        "analyzer": "ik_max_word"
      }
    }
  }
}

查看一下hello05索引信息：

{
"state": "open",
"settings": {
"index": {
"creation_date": "1618410744334",
"number_of_shards": "1",
"number_of_replicas": "1",
"uuid": "isCuH2wTQ8S3Yw2MSspvGA",
"version": {
"created": "7060199"
},
"provided_name": "hello05"
}
},
"mappings": {
"_doc": {
"properties": {
"name": {
"analyzer": "ik_max_word",     //说明指定字段类型成功了
"type": "text"
},
"say": {
"analyzer": "ik_max_word",
"type": "text"
}
}
}
},
"aliases": [ ],
"primary_terms": {
"0": 1
},
"in_sync_allocations": {
"0": [
"lh6O9N8KQNKtLqD3PSU-Fg"
]
}
}

指定索引映射字段只能使用一次***

我们再重新往hello05索引添加mapping映射：

PUT /hello05
{
  "mappings": {
    "properties": {
      "name":{
        "type": "text",
        "analyzer": "ik_max_word"
      },
      "say":{
        "type": "text",
        "analyzer": "ik_max_word"
      },
      "age":{
        "type": "integer"
      }
    }
  }
}

然后，报错了！！！！！！

{
  "error" : {
    "root_cause" : [
      {
        "type" : "resource_already_exists_exception",
        "reason" : "index [hello05/isCuH2wTQ8S3Yw2MSspvGA] already exists",
        "index_uuid" : "isCuH2wTQ8S3Yw2MSspvGA",
        "index" : "hello05"
      }
    ],
    "type" : "resource_already_exists_exception",
    "reason" : "index [hello05/isCuH2wTQ8S3Yw2MSspvGA] already exists",
    "index_uuid" : "isCuH2wTQ8S3Yw2MSspvGA",
    "index" : "hello05"
  },
  "status" : 400
}

**注意：==============**

原因是：在我们创建了索引映射属性后，es底层就会给我们创建倒排索引（不可以再次进行修改），但是可以添加新的字段，或者重新创建一个新索引，用reindex把旧索引的信息放到新索引里面去。

所以：我们在创建索引mapping属性的时候要再三考虑

不然，剩下没有指定的字段就只能使用es默认提供的了

使用"_mapping"，往索引添加字段

我们上面说过，mapping映射字段不能修改，但是没有说不能添加，添加的方式有一些不同。

PUT hello05/_mapping
{
   
    
    "properties": {
      
      "ls":{
        "type": "keyword"
      }
      
    }
    
   
  
}

使用_reindex实现数据迁移

使用场景：当mapping设置完之后发现有几个字段需要“修改”，此时我们可以先创建一个新的索引，然后定义好字段，然后把旧索引的数据全部导入进新索引

POST _reindex
{
  
  "source": {
    "index": "hello05",
    "type": "_doc"
  }, 
  
  "dest": {
    "index": "hello06"
  }
  
  
}

#! Deprecation: [types removal] Specifying types in reindex requests is deprecated.
{
  "took" : 36,
  "timed_out" : false,
  "total" : 5,
  "updated" : 0,
  "created" : 5,
  "deleted" : 0,
  "batches" : 1,
  "version_conflicts" : 0,
  "noops" : 0,
  "retries" : {
    "bulk" : 0,
    "search" : 0
  },
  "throttled_millis" : 0,
  "requests_per_second" : -1.0,
  "throttled_until_millis" : 0,
  "failures" : [ ]
}

获取索引信息

GET hello05
{
    
}

获取指定索引中所有的记录（_search）

GET hello05/_search
{
  "query": {
    
    "match_all": {}
    
  }
}

获取索引指定的数据

GET hello05/_doc/1
{ 
}

获取指定索引全部数据(match_all:{})

GET hello05/_search
{ 
}

和上面的是一样的

GET hello05/_search
{
  "query": {
    
    "match_all": {}
    
  }
  
  
}

match查询(只允许单个查询条件)

match查询是可以把查询条件进行分词的。

GET hello05/_search
{
   "query": {
     
     "match": {
        "name": "李"   //查询条件
        
     }
     
   }
}

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.9395274,
    "hits" : [
      {
        "_index" : "hello05",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.9395274,
        "_source" : {
          "name" : "李四",
          "age" : 3
        }
      },
      {
        "_index" : "hello05",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 0.79423964,
        "_source" : {
          "name" : "李小龙",
          "age" : 45
        }
      }
    ]
  }
}

如果我们再加多一个查询条件

GET hello05/_search
{
   "query": {
     
     "match": {
        "name": "李"
        , "age": 45
     }
     
   }
  
}

就会报错，原因是match只允许一个查询条件，多条件可以用query bool must 来实现

{
  "error" : {
    "root_cause" : [
      {
        "type" : "parsing_exception",
        "reason" : "[match] query doesn't support multiple fields, found [name] and [age]",
        "line" : 6,
        "col" : 18
      }
    ],
    "type" : "parsing_exception",
    "reason" : "[match] query doesn't support multiple fields, found [name] and [age]",
    "line" : 6,
    "col" : 18
  },
  "status" : 400
}

精准查询(term)和模糊查询(match)区别

match:

GET hello05/_search
{
  "query": {
     
     "match": {
       "name": "李龙"
     }
    
  }
  
  
}

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 2.0519087,
    "hits" : [
      {
        "_index" : "hello05",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 2.0519087,
        "_source" : {
          "name" : "李小龙",
          "age" : 45
        }
      },
      {
        "_index" : "hello05",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.9395274,
        "_source" : {
          "name" : "李四",
          "age" : 3
        }
      }
    ]
  }
}

**==================**

term :

GET hello05/_search
{
  "query": {
     
     "term": {
       "name": "李龙"
     }
    
  }
  
  
}

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

区别是：

1：match的查询条件是会经过分词器分词的，然后再去和倒排索引去对比（对比term效率较低）

2：term的查询条件是不会分词的，是直接拿去和倒排索引去对比的，效率较高

3:同样term也是只能支持一个查询条件的

multi_match实现类似于百度搜索

match和multi_match的区别在于match只允许传入的数据在一个字段上搜索，而multi_match可以在多个字段中搜索

例如：我们要实现输入李小龙，然后在title字段和content字段中搜索，就要用到multi_match，普通的match不可以

模拟京东搜索商品

PUT /goods
{
  "mappings": {
    
    "properties": {
      
      "title":{
        "analyzer": "standard",
        "type" : "text"
      },
      "content":{
        "analyzer": "standard",
        "type": "text"
      }
      
    }
    
  }
  
  
  
}

GET goods/_search
{
  
  "query": {
    //下面输入华为，会进行分词，然后在title和content两个字段中搜索
    "multi_match": {
      "query": "华为",
      "fields": ["title","content"]
    }
    
  }
   
  
}

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.1568705,
    "hits" : [
      {
        "_index" : "goods",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.1568705,
        "_source" : {
          "title" : "华为Mate30",
          "content" : "华为Mate30 8+128G，麒麟990Soc",
          "price" : "3998"
        }
      },
      {
        "_index" : "goods",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0173018,
        "_source" : {
          "title" : "华为P40",
          "content" : "华为P40 8+256G，麒麟990Soc，贼牛逼",
          "price" : "4999"
        }
      }
    ]
  }
}

短语(精准)搜索(match_phrase)

GET goods/_search
{
  "query": {
    
    "match_phrase": {
      "content": "华为P40手机"
    }
    
  }
   
}

结果查不到数据，原因是match_phrase是短语搜索，也就是精确搜索

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

指定查询显示字段(_source)

elasticsearch默认的显示字段规则类似于MYSQL的select * from xxx ，我们可以自定义成类似于select id,name from xxx

GET goods/_search
{
  
  "query": {
    
    "multi_match": {
      "query": "华为",
      "fields": ["title","content"]
    }
       
  }
   , "_source" :  ["title","content"]  //指定只显示title和content
  
}

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.1568705,
    "hits" : [
      {
        "_index" : "goods",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.1568705,
        "_source" : {
          "title" : "华为Mate30",
          "content" : "华为Mate30 8+128G，麒麟990Soc"
        }
      },
      {
        "_index" : "goods",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0173018,
        "_source" : {
          "title" : "华为P40",
          "content" : "华为P40 8+256G，麒麟990Soc，贼牛逼"
        }
      }
    ]
  }
}

排序sort

因为前面设计索引mapping失误，price没有进行设置，导致price是text类型，无法进行排序和filter range，所以我们再添加一个字段，od

POST goods/_update/1
{
  "doc": {
    
    "od":1
    
  }
}

省略2 3 4

GET goods/_search
{
  
  "query": {
    
    "multi_match": {
      "query": "华为",
      "fields": ["title","content"]
    }
    
  }
  , "sort": [
    {
      "od": {
        "order": "desc"  //asc升序，desc降序
      }
    }
  ]
    
  
}

分页

GET goods/_search
{
   "query": {
     
     "match_all": {}
     
   }
   , "sort": [
     {
       "od": {
         "order": "desc"
       }
     }
   ]
 , "from" : 0
   , "size": 2
}

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : "goods",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : null,
        "_source" : {
          "title" : "IQOONEO5",
          "content" : "IQOONEO5 高通骁龙870Soc ,",
          "price" : "2499",
          "od" : 4
        },
        "sort" : [
          4
        ]
      },
      {
        "_index" : "goods",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : null,
        "_source" : {
          "title" : "小米11",
          "content" : "小米11 高通骁龙888Soc ,1亿像素",
          "price" : "4500",
          "od" : 3
        },
        "sort" : [
          3
        ]
      }
    ]
  }
}

字段高亮（highlight）

可以选择一个或者多个字段高亮，然后被选择的这些字段如果被条件匹配到则会默认加em标签

GET goods/_search
{
   "query": {
     
     "match": {
       "title": "华为P40"
     }
     
   },
   "highlight": {
     
     "fields": {
       
       "title": {}
       
     }
     
   }
   
}

结果

{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 2.7309713,
    "hits" : [
      {
        "_index" : "goods",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 2.7309713,
        "_source" : {
          "title" : "华为P40",
          "content" : "华为P40 8+256G，麒麟990Soc，贼牛逼",
          "price" : "4999",
          "od" : 1
        },
        "highlight" : {
          "title" : [
            "华为P40"
          ]
        }
      },
      {
        "_index" : "goods",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.5241971,
        "_source" : {
          "title" : "华为Mate30",
          "content" : "华为Mate30 8+128G，麒麟990Soc",
          "price" : "3998",
          "od" : 2
        },
        "highlight" : {
          "title" : [
            "华为Mate30"
          ]
        }
      }
    ]
  }
}

默认是em标签，我们可以更改他的前缀和后缀，利用前端的知识

GET goods/_search
{
   "query": {
     
     "match": {
       "title": "华为P40"
     }
     
   },
   "highlight": {
     "pre_tags": "",
     "post_tags": "" ,
     "fields": {
       
       "title": {}
       
     }
     
   }
   
}

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 2.7309713,
    "hits" : [
      {
        "_index" : "goods",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 2.7309713,
        "_source" : {
          "title" : "华为P40",
          "content" : "华为P40 8+256G，麒麟990Soc，贼牛逼",
          "price" : "4999",
          "od" : 1
        },
        "highlight" : {
          "title" : [
            "华为P40"
          ]
        }
      },
      {
        "_index" : "goods",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.5241971,
        "_source" : {
          "title" : "华为Mate30",
          "content" : "华为Mate30 8+128G，麒麟990Soc",
          "price" : "3998",
          "od" : 2
        },
        "highlight" : {
          "title" : [
            "华为Mate30"
          ]
        }
      }
    ]
  }
}

模仿百度搜索高亮

例如百度搜索华为P40，不仅仅是title会高亮，content也会高亮，所以我们可以用multi_match+highlight实现

GET goods/_search
{
  "query": {
   
   "multi_match": {
     "query": "华为P40",
     "fields": ["title","content"]
   }
  }
  
  , "highlight": {
    "pre_tags": "",
    "post_tags": "", 
    "fields": {
      
      "title": {},
      "content": {}
    }
    
  }
  
  
  
}

{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 2.8157697,
    "hits" : [
      {
        "_index" : "goods",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 2.8157697,
        "_source" : {
          "title" : "华为P40",
          "content" : "华为P40 8+256G，麒麟990Soc，贼牛逼",
          "price" : "4999",
          "od" : 1
        },
        "highlight" : {
          "title" : [
            "华为P40"
          ],
          "content" : [
            "华为P40 8+256G，麒麟990Soc，贼牛逼"
          ]
        }
      },
      {
        "_index" : "goods",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.8023796,
        "_source" : {
          "title" : "华为Mate30",
          "content" : "华为Mate30 8+128G，麒麟990Soc",
          "price" : "3998",
          "od" : 2
        },
        "highlight" : {
          "title" : [
            "华为Mate30"
          ],
          "content" : [
            "华为Mate30 8+128G，麒麟990Soc"
          ]
        }
      }
    ]
  }
}

bool查询(用作于多条件查询)

类似于MYSQL的and or

重点：must 代表and ，should 代表 or

must（and）的使用：

下面我们在must里面给了两个条件，如果这里是must，那就必须两个条件都要满足

GET goods/_search
{
  
  "query": {
     
      "bool": {
        
        "must": [
          {
          "match": {
            "title": "华为"
          }
          },
          {
            "match": {
              "content": "MATE30"
            }
          }  
        ] 
        
      }
  }
}

结果：

{
  "took" : 10,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 2.9512205,
    "hits" : [
      {
        "_index" : "goods",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 2.9512205,
        "_source" : {
          "title" : "华为Mate30",
          "content" : "华为Mate30 8+128G，麒麟990Soc",
          "price" : "3998",
          "od" : 2
        }
      }
    ]
  }
}

should（or）的使用：

should里面同样有两个条件，但是只要满足一个就可以了

GET goods/_search
{
  
  "query": {
     
      "bool": {
        
        "should": [
          {
          "match": {
            "title": "华为"
          }
          },
          {
            "match": {
              "content": "MATE30"
            }
          }
          
          
        ] 
        
      }
  }
}

结果：

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 2.9512205,
    "hits" : [
      {
        "_index" : "goods",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 2.9512205,
        "_source" : {
          "title" : "华为Mate30",
          "content" : "华为Mate30 8+128G，麒麟990Soc",
          "price" : "3998",
          "od" : 2
        }
      },
      {
        "_index" : "goods",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.5241971,
        "_source" : {
          "title" : "华为P40",
          "content" : "华为P40 8+256G，麒麟990Soc，贼牛逼",
          "price" : "4999",
          "od" : 1
        }
      }
    ]
  }
}

过滤器，区间条件（filter range）

比如我们要实现，输入title=xx，我们如果想得到price>4000作为一个条件，可以用到这个。

GET goods/_search
{
  
  "query": {
     
      "bool": {
        
        "must": [
          {
          "match": {
            "title": "小米"
          }  
          }
        ],"filter": {
          "range": {
            "price": {
              "gt": 4000
            }
          }
        }
      }
  }
}

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 2.4135482,
    "hits" : [
      {
        "_index" : "goods",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 2.4135482,
        "_source" : {
          "title" : "小米11",
          "content" : "小米11 高通骁龙888Soc ,1亿像素",
          "price" : "4500",
          "od" : 3
        }
      }
    ]
  }
}

查看整个es的索引信息

GET _cat/indices?v

elasticsearch的Java Api 准备阶段

1.导入elasticsearch高级客户端依赖和elasticsearch依赖（注意版本要和本机的es版本一致）,我们本机现在用的是7.6.1的es

 
       
            org.elasticsearch.client
            elasticsearch-rest-high-level-client

            7.6.1
        

        
            org.elasticsearch
            elasticsearch

            7.6.1
        

        
            com.alibaba
            fastjson
            1.2.75

2.打开RestHighLevelClient的构造器：

public RestHighLevelClient(RestClientBuilder restClientBuilder) {
        this(restClientBuilder, Collections.emptyList());
    }

我们发现需要传入一个RestClientBuilder，但是这个对象我们需要通过RestClient来得到，而不是RestClientBuilder

3.打开RestClient：

 public static RestClientBuilder builder(HttpHost... hosts) {
        if (hosts == null || hosts.length == 0) {
            throw new IllegalArgumentException("hosts must not be null nor empty");
        }
        List nodes = Arrays.stream(hosts).map(Node::new).collect(Collectors.toList());
        return new RestClientBuilder(nodes);
    }

我们发现RestClient的builder可以得到RestClientBuilder，然后我们点进去看HttpHost：

public HttpHost(String hostname, int port, String scheme) { //es所在主机名，es的端口号，协议（默认http）
        this.hostname = (String)Args.containsNoBlanks(hostname, "Host name");
        this.lcHostname = hostname.toLowerCase(Locale.ROOT);
        if (scheme != null) {
            this.schemeName = scheme.toLowerCase(Locale.ROOT);
        } else {
            this.schemeName = "http";
        }

        this.port = port;
        this.address = null;
    }

4.然后我们就配置好了如下：

		HttpHost httpHost = new HttpHost("localhost",9200,"http");
        RestClientBuilder restClientBuilder = RestClient.builder(httpHost);
        RestHighLevelClient restHighLevelClient = new RestHighLevelClient(restClientBuilder);

5.为了方便，我们可以把这个RestHighLevelClient交给SpringIOC容器管理，后面我们自动注入即可

@Configuration
public class esConfig {


    @Bean
    public RestHighLevelClient restHighLevelClient(){
        HttpHost httpHost = new HttpHost("localhost",9200,"http");
        RestClientBuilder builder = RestClient.builder(httpHost);
        RestHighLevelClient restHighLevelClient = new RestHighLevelClient(builder);
        return restHighLevelClient;
    }
 
}

索引操作

java elasticsearch api操作索引都是用restHighLevelClient.indices().xxxxx()的格式

创建索引

//创建索引
    @Test
    public void createIndex() throws IOException {
        RestClientBuilder builder = RestClient.builder(new HttpHost("localhost", 9200, "http"));
        RestHighLevelClient restHighLevelClient = new RestHighLevelClient(builder);
        //new一个创建索引请求，并传入一个创建的索引名称
        CreateIndexRequest createIndexRequest = new CreateIndexRequest("java01");
        //向es发送创建索引请求。
        CreateIndexResponse createIndexResponse = restHighLevelClient.indices().create(createIndexRequest, RequestOptions.DEFAULT);

        restHighLevelClient.close();

    }

删除索引

//删除索引
    @Test
    public void deleteIndex() throws IOException {

        RestClientBuilder builder = RestClient.builder(new HttpHost("localhost", 9200, "http"));
        RestHighLevelClient restHighLevelClient = new RestHighLevelClient(builder);

        //new一个删除索引请求，并传入需要删除的索引名称
        DeleteIndexRequest deleteIndexRequest = new DeleteIndexRequest("java01");
        //resthighLevelClient发送删除索引请求
        restHighLevelClient.indices().delete(deleteIndexRequest,RequestOptions.DEFAULT);
        restHighLevelClient.close();

    }

检查索引是否存在

//检查索引是否存在
    @Test
    public void indexExsit() throws IOException {
        RestClientBuilder builder = RestClient.builder(new HttpHost("localhost", 9200, "http"));
        RestHighLevelClient restHighLevelClient = new RestHighLevelClient(builder);

        GetIndexRequest getIndexRequest = new GetIndexRequest("goods");


        boolean exists = restHighLevelClient.indices().exists(getIndexRequest, RequestOptions.DEFAULT);

        System.out.println(exists);


    }

文档操作创建指定id的文档

//创建文档
    @Test
    public void createIndexDoc() throws IOException {

        RestClientBuilder builder = RestClient.builder(new HttpHost("localhost", 9200, "http"));
        RestHighLevelClient restHighLevelClient = new RestHighLevelClient(builder);

        IndexRequest indexRequest = new IndexRequest("hello");
        //指定文档id
        indexRequest.id("1");
        
        Map source=new HashMap<>();
        source.put("a_age","50");
        source.put("a_address","广州");
        //在es里面，一切皆为JSON，我们要把Map用fastjson转换成JSON字符串，XContentType指定为JSON类型
        indexRequest.source(JSON.toJSONString(source), XContentType.JSON);

        IndexResponse response = restHighLevelClient.index(indexRequest, RequestOptions.DEFAULT);

        System.out.println("response:"+response);
        System.out.println("status:"+response.status());

    }

删除指定id的文档

  //删除文档
    @Test
    public void deleteDoc() throws IOException {

        RestClientBuilder builder = RestClient.builder(new HttpHost("localhost", 9200, "http"));

        RestHighLevelClient restHighLevelClient = new RestHighLevelClient(builder);


        DeleteRequest deleteRequest = new DeleteRequest("hello");

        deleteRequest.id("1");

        DeleteResponse delete = restHighLevelClient.delete(deleteRequest, RequestOptions.DEFAULT);
        System.out.println(delete.status());

    }

修改指定id的文档

//修改文档
    @Test
    public void updateDoc() throws IOException {

        RestClientBuilder builder = RestClient.builder(new HttpHost("localhost", 9200, "http"));
        RestHighLevelClient restHighLevelClient = new RestHighLevelClient(builder);

        
        UpdateRequest updateRequest = new UpdateRequest("hello","1");

        Map source=new HashMap<>();
        source.put("a_address","河源");
        updateRequest.doc(JSON.toJSONString(source),XContentType.JSON);
        UpdateResponse response = restHighLevelClient.update(updateRequest, RequestOptions.DEFAULT);
        System.out.println(response.status());
    }

获取指定id的文档

 //获取文档
    @Test
    public void getDoc() throws IOException {

        RestClientBuilder builder = RestClient.builder(new HttpHost("localhost", 9200, "http"));
        RestHighLevelClient restHighLevelClient = new RestHighLevelClient(builder);


        GetRequest getRequest = new GetRequest("hello");
        getRequest.id("1");

        GetResponse response = restHighLevelClient.get(getRequest, RequestOptions.DEFAULT);

        String sourceAsString = response.getSourceAsString();
        System.out.println(sourceAsString);

    }

搜索(匹配全文match_all)

//搜索(匹配全文match_all)
    @Test
    public void search_matchAll() throws IOException {

        RestClientBuilder builder = RestClient.builder(new HttpHost("localhost", 9200, "http"));
        RestHighLevelClient restHighLevelClient = new RestHighLevelClient(builder);

        
        SearchRequest searchRequest = new SearchRequest("hello");

        //相当于文本
        SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
        MatchAllQueryBuilder matchAllQueryBuilder = QueryBuilders.matchAllQuery();
        searchSourceBuilder.query(matchAllQueryBuilder); //相当于search的query

        searchRequest.source(searchSourceBuilder);




        SearchResponse search = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);

        SearchHit[] hits = search.getHits().getHits();

 
        for (SearchHit hit : hits) {
            System.out.println(hit.getSourceAsString());
        }
    }

搜索(模糊查询match)

//模糊搜索match
    @Test
    public void search_match() throws IOException {

        RestClientBuilder builder = RestClient.builder(new HttpHost("localhost", 9200, "http"));

        RestHighLevelClient restHighLevelClient = new RestHighLevelClient(builder);

        SearchRequest searchRequest = new SearchRequest();

        //查询文本
        SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
        MatchQueryBuilder matchQueryBuilder = QueryBuilders.matchQuery("a_address", "广州");
        searchSourceBuilder.query(matchQueryBuilder);

        searchRequest.source(searchSourceBuilder);

        SearchResponse search = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);

        SearchHit[] hits = search.getHits().getHits();
 
        for (SearchHit hit : hits) {
            System.out.println(hit.getSourceAsString());
        }

    }

搜索(多字段搜索multi_match)

 //搜索(多字段搜索multi_match)
    @Test
    public void  search_term() throws IOException {
        RestClientBuilder builder = RestClient.builder(new HttpHost("localhost", 9200, "http"));
        RestHighLevelClient restHighLevelClient = new RestHighLevelClient(builder);

        SearchRequest searchRequest = new SearchRequest("goods");

        SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
        searchSourceBuilder.query(QueryBuilders.multiMatchQuery("华为","title","content"));

        searchRequest.source(searchSourceBuilder);


        SearchResponse search = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);

        SearchHit[] hits = search.getHits().getHits();
        for (SearchHit hit : hits) {
            System.out.println(hit.getSourceAsString());
        }

    }

搜索(筛选字段fetchSource)

fetchsource方法相当于_source

//fetchsource实现筛选字段(_source)
    @Test
    public void search_source() throws IOException {

        RestClientBuilder builder = RestClient.builder(new HttpHost("localhost", 9200, "http"));

        RestHighLevelClient restHighLevelClient = new RestHighLevelClient(builder);

        SearchRequest searchRequest = new SearchRequest("goods");

        SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();

        searchSourceBuilder.query(QueryBuilders.matchAllQuery());
        
        String[] includes={"title"}; //包含
        String[] excludes={}; //排除
        searchSourceBuilder.fetchSource(includes,excludes);

        searchRequest.source(searchSourceBuilder);

        SearchResponse search = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);

        SearchHit[] hits = search.getHits().getHits();
        for (SearchHit hit : hits) {
            System.out.println(hit.getSourceAsString());
        }


    }

分页、排序、字段高亮

我们要把下面的es命令行代码转换成Java代码

GET goods/_search
{
  
  "query": {
    
      "match": {
        "title": "华为"
      }
      
    
  },"sort": [
    {
      "od": {
        "order": "desc"
      }
    }
  ]
  
  ,"from": 0,
  "size": 1,
  "highlight": {
    "pre_tags": "",
    "post_tags": "", 
    "fields": {
      
      "title": {}
    }
    
  } 
}

Java 实现

//分页，排序，字段高亮
    @Test
    public void page_sort_HighLight() throws IOException {

        RestClientBuilder builder = RestClient.builder(new HttpHost("localhost", 9200, "http"));
        RestHighLevelClient restHighLevelClient = new RestHighLevelClient(builder);

        SearchRequest searchRequest = new SearchRequest("goods");

        SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();

        MatchQueryBuilder matchQueryBuilder = QueryBuilders.matchQuery("title", "华为");


        searchSourceBuilder.query(matchQueryBuilder);

        //分页====
        searchSourceBuilder.from(0);
        searchSourceBuilder.size(1);
        //=======

        //排序
        searchSourceBuilder.sort("od", SortOrder.DESC);


        //字段高亮
        //=========高亮开始==
        HighlightBuilder highlightBuilder = new HighlightBuilder();

        //构建高亮的前缀后缀标签pre_tag和post_tag
        highlightBuilder.preTags("");
        highlightBuilder.postTags("");

        //highlightBuilder.field()方法我们用一个String类型的
        
        highlightBuilder.field("title");
        //如果还需要更多字段高亮，则多写一遍field方法
//        highlightBuilder.field(); //第二个字段高亮
//        highlightBuilder.field(); //第三个字段高亮 。。。。。以此类推

        searchSourceBuilder.highlighter(highlightBuilder);

        //====================高亮结束



        searchRequest.source(searchSourceBuilder);

        SearchResponse search = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);

        SearchHit[] hits = search.getHits().getHits(); //hits里面封装了命中的所有数据
        for (SearchHit hit : hits) {
            Map highlightFields = hit.getHighlightFields();
            System.out.println("highlightMap:"+highlightFields);
            //通过title这个key去获取fragments
            //fragment里面是高亮之后的字段内容（很重要，可以用来覆盖原来没高亮的字段内容） 华为Mate30
            System.out.println("fragments:"+Arrays.toString(highlightFields.get("title").getFragments()));
        }


        restHighLevelClient.close();


    }

布尔搜索(bool)

实现类似如下es代码：

GET goods/_search
{
  "query": {
    
    "bool": {
      
      "should": [
        {
         
         "term": {
           "title": {
             "value": "华"
           }
         }
          
        },
        {
          
          "term": {
            "title": {
              "value": "米"
            }
          }
          
        }
      ]
      
    }
    
  }

}

Java实现：

 //布尔搜索(bool)
    @Test
    public void search_bool() throws IOException {

        RestClientBuilder builder = RestClient.builder(new HttpHost("localhost", 9200, "http"));
        RestHighLevelClient restHighLevelClient = new RestHighLevelClient(builder);

        SearchRequest searchRequest = new SearchRequest("goods");

        SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();

        //通过searchSourceBuilder对象构建bool查询对象
        BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery();

        //这里should只能写一个，如should里面有多个条件，可以写多个should
        
        //例如上面should有两个条件，我们就要写两个should
        boolQueryBuilder.should(QueryBuilders.termQuery("title","华"));
        boolQueryBuilder.should(QueryBuilders.termQuery("title","米"));
        searchSourceBuilder.query(boolQueryBuilder);
        

        searchRequest.source(searchSourceBuilder);


        SearchResponse search = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);

        SearchHit[] hits = search.getHits().getHits();
        for (SearchHit hit : hits) {
            System.out.println(hit.getSourceAsString());
        }


        restHighLevelClient.close();

    }

es实战(京东商品搜索) 从京东上爬取数据

1:导入依赖：

	
         
            org.jsoup
            jsoup
            1.12.1

2.创建实体类：

public class goods{

    private String img; //商品图片
    private String price; //商品价格
    private String title; //商品标题

    public goods() {
    }

    public goods(String img, String price, String title) {
        this.img = img;
        this.price = price;
        this.title = title;
    }

    public String getImg() {
        return img;
    }

    public void setImg(String img) {
        this.img = img;
    }

    public String getPrice() {
        return price;
    }

    public void setPrice(String price) {
        this.price = price;
    }

    public String getTitle() {
        return title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    @Override
    public String toString() {
        return "goods{" +
                "img='" + img + ''' +
                ", price='" + price + ''' +
                ", title='" + title + ''' +
                '}';
    }
}

3.利用jsoup解析爬取京东商城搜索(核心)，编写工具类：

@Component
public class jsoupUtils {


    private static RestHighLevelClient restHighLevelClient;

    @Autowired
    public  void setRestHighLevelClient(RestHighLevelClient restHighLevelClient) {
        jsoupUtils.restHighLevelClient = restHighLevelClient;
    }

    
    public static void searchData_JD(String keyword) {

        BulkRequest bulkRequest = new BulkRequest();

        try {
            URL url = null;
            try {
                url = new URL("https://search.jd.com/Search?keyword=" + keyword);
            } catch (MalformedURLException e) {
                e.printStackTrace();
            }

            document document = null;//jsoup解析URL
            try {
                document = Jsoup.parse(url, 30000);
            } catch (IOException e) {
                e.printStackTrace();
            }

            Element e1 = document.getElementById("J_goodsList");

            Elements e_lis = e1.getElementsByTag("li");

            for (Element e_li : e_lis) {

                //这边可能获取到多个价格，因为有些有套餐价格，我们可以获取第一个价格
                Elements e_price = e_li.getElementsByClass("p-price");
                String text = e_price.get(0).text();
                //这里获取的价格可能有多个，正常价和京东PLUS会员专享价，所以我们要进行切分
                String realPirce = "￥";
                int x = 1; //默认第一个就是￥的符号，也从1开始遍历，如果还有￥符号就break即可
                for (int i = 1; i < text.length(); i++) {

                    if (text.charAt(i) == '￥') {
                        break;
                    } else {
                        realPirce += text.charAt(i);
                    }

                }
                //商品图片
                Elements e_img = e_li.getElementsByClass("p-img");
                Elements img = e_img.get(0).getElementsByTag("img");
                //因为京东的商品图片不是封装到src里面的，而是封装到懒加载属性==data-lazy-img
                String src = img.get(0).attr("data-lazy-img");
                System.out.println("http:" + src);


                //价格
                System.out.println(realPirce);
                //商品标题
                Elements e_title = e_li.getElementsByClass("p-name");
                String title = e_title.get(0).getElementsByTag("em").text();
                System.out.println(title);

                IndexRequest indexRequest = new IndexRequest("jd_goods");

                //添加信息
                Map good=new HashMap<>();
                good.put("img","http:" + src);
                good.put("price",realPirce);
                good.put("title",title);
                IndexRequest source = indexRequest.source(JSON.toJSONString(good), XContentType.JSON);

                bulkRequest.add(source);


            }
            //批量操作，减少访问es服务器的次数
              restHighLevelClient.bulk(bulkRequest, RequestOptions.DEFAULT);

        }catch (Exception e){
            System.out.println(e.getMessage());
        }



    }
}

4.使用工具类：

public static void main(String[] args) {
        SpringApplication.run(DemoApplication.class, args);

        jsoupUtils.searchData_JD("vivo"); 

    }

有了数据我们就可以用来展示到页面上了。。。。。

Java手写笔记-分布式搜索引擎ElasticSearch（作者原创）

大数据系统相关栏目本月热门文章