前言:想要谷歌和百度已经够用了,这里实现的搜索只是为了方便自己做后续的事情的一个小实践。
想要实现一个搜索引擎,首先需要考虑出完整的架构。
页面抓取存储分析搜索实现展现 页面抓取
首先,页面抓取我打算采取最简单的HttpClient的方式,可能有人会说,你这样做会漏掉大量使用Web2.0的网站的,是的,没错,最开始我为了验证架构的可用性,就是要漏掉一些复杂的点。
存储然后,存储,我打算直接使用文件系统进行实体存储,在搜索使用的时候,全部将结果加载到内存中。可能有的人会说,你这样好消耗内存哦,是的,没错,我可以用大量的swap空间,用性能换内存。
分析分析部分,我打算直接使用分词算法,解析出词频,建立文章的倒排索引,但是不存储文章的所有词语的倒排索引,毕竟要考虑到未优化的文件系统的存取性能。我这里的方案是直接取词频在20~50范围内的词以及网站标题的分词结果作为网站的关键词,建立倒排系统而存在。为了描述不显得那么空白和抽象,这里贴出最后的结构:image.png
文件的标题名就是分词的词语名,文件里面存储的是所有关键词有该词的网站域名,如下:
有点类似elasticsearch底层的存储原理,不过我没有做什么优化。
搜索实现搜索实现部分,我打算直接将上述文件加载到内存中,直接使用HashMap存储,方便读取。
展现为了方便随点随用,我打算直接使用谷歌浏览器插件的形式进行展现实现。
好了,现在理论架构差不多了,那么就开始动手实现吧
动手实现 页面抓取刚才提到了,这里直接使用HttpClient进行页面抓取,除此之外,还涉及对页面的外链解析。在说外链解析之前,我打算先说说我的抓取思路。
将整个互联网想象成一张巨大的网,网站间通过链接的方式相互串联,虽然这里面有大量的网站是孤岛,但是不妨碍对绝大多数网站的抓取。所以这里采取的方案就是多点为主要节点的广度优先遍历,对单个网站只抓取首页的内容,分析其中的所有外链,然后作为目标进行抓取。
抓取页面的代码如下:
import com.chaojilaji.auto.autocode.generatecode.GenerateFile;
import com.chaojilaji.auto.autocode.standartReq.SendReq;
import com.chaojilaji.auto.autocode.utils.Json;
import com.chaojilaji.moneyframework.model.OnePage;
import com.chaojilaji.moneyframework.model.Word;
import com.chaojilaji.moneyframework.service.Nlp;
import com.chaojilaji.moneyframework.utils.DomainUtils;
import com.chaojilaji.moneyframework.utils.HtmlUtil;
import com.chaojilaji.moneyframework.utils.MDUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.springframework.stereotype.Service;
import org.springframework.util.StringUtils;
import java.io.*;
import java.util.*;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentSkipListSet;
public class HttpClientCrawl {
private static Log logger = LogFactory.getLog(HttpClientCrawl.class);
public Set oldDomains = new ConcurrentSkipListSet<>();
public Map onePageMap = new ConcurrentHashMap<>(400000);
public Set ignoreSet = new ConcurrentSkipListSet<>();
public Map> siteMaps = new ConcurrentHashMap<>(50000);
public String domain;
public HttpClientCrawl(String domain) {
this.domain = DomainUtils.getDomainWithCompleteDomain(domain);
String[] ignores = {"gov.cn", "apac.cn", "org.cn", "twitter.com"
, "baidu.com", "google.com", "sina.com", "weibo.com"
, "github.com", "sina.com.cn", "sina.cn", "edu.cn", "wordpress.org", "sephora.com"};
ignoreSet.addAll(Arrays.asList(ignores));
loadIgnore();
loadWord();
}
private Map defaultHeaders() {
Map ans = new HashMap<>();
ans.put("Accept", "application/json, text/plain, */*");
ans.put("Content-Type", "application/json");
ans.put("Connection", "keep-alive");
ans.put("Accept-Language", "zh-CN,zh;q=0.9");
ans.put("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36");
return ans;
}
public SendReq.ResBody doRequest(String url, String method, Map params) {
String urlTrue = url;
SendReq.ResBody resBody = SendReq.sendReq(urlTrue, method, params, defaultHeaders());
return resBody;
}
public void loadIgnore() {
File directory = new File(".");
try {
String file = directory.getCanonicalPath() + "/moneyframework/generate/ignore/demo.txt";
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(new File(file))));
String line = "";
while ((line = reader.readLine()) != null) {
String x = line.replace("[", "").replace("]", "").replace(" ", "");
String[] y = x.split(",");
ignoreSet.addAll(Arrays.asList(y));
}
} catch (IOException e) {
e.printStackTrace();
}
}
public void loadDomains(String file) {
File directory = new File(".");
try {
File file1 = new File(directory.getCanonicalPath() + "\" + file);
logger.info(directory.getCanonicalPath() + "\" + file);
if (!file1.exists()) {
file1.createNewFile();
}
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file1)));
String line = "";
while ((line = reader.readLine()) != null) {
line = line.trim();
OnePage onePage = new OnePage(line);
if (!oldDomains.contains(onePage.getDomain())) {
onePageMap.put(onePage.getDomain(), onePage);
oldDomains.add(onePage.getDomain());
}
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
public void handleWord(List s, String domain, String title) {
for (String a : s) {
String x = a.split(" ")[0];
String y = a.split(" ")[1];
Set z = siteMaps.getOrDefault(x, new ConcurrentSkipListSet<>());
if (Integer.parseInt(y) >= 10) {
if (z.contains(domain)) continue;
z.add(domain);
siteMaps.put(x, z);
GenerateFile.appendFileWithRelativePath("moneyframework/domain/markdown", x + ".md", MDUtils.getMdContent(domain, title, s.toString()));
}
}
Set xxxx = Nlp.separateWordAndReturnUnit(title);
for (Word word : xxxx) {
String x = word.getWord();
Set z = siteMaps.getOrDefault(x, new ConcurrentSkipListSet<>());
if (z.contains(domain)) continue;
z.add(domain);
siteMaps.put(x, z);
GenerateFile.appendFileWithRelativePath("moneyframework/domain/markdown", x + ".md", MDUtils.getMdContent(domain, title, s.toString()));
}
}
public void loadWord() {
File directory = new File(".");
try {
File file1 = new File(directory.getCanonicalPath() + "\moneyframework/domain/markdown");
if (file1.isDirectory()) {
int fileCnt = 0;
File[] files = file1.listFiles();
for (File file : files) {
fileCnt ++;
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
String line = "";
siteMaps.put(file.getName().replace(".md", ""), new ConcurrentSkipListSet<>());
while ((line = reader.readLine()) != null) {
line = line.trim();
if (line.startsWith("####")) {
siteMaps.get(file.getName().replace(".md", "")).add(line.replace("#### ", "").trim());
}
}
}catch (Exception e){
}
if ((fileCnt % 1000 ) == 0){
logger.info((fileCnt * 100.0) / files.length + "%");
}
}
}
for (Map.Entry> xxx : siteMaps.entrySet()){
oldDomains.addAll(xxx.getValue());
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
public void doTask() {
String root = "http://" + this.domain + "/";
Queue urls = new linkedList<>();
urls.add(root);
Set tmpDomains = new HashSet<>();
tmpDomains.addAll(oldDomains);
tmpDomains.add(DomainUtils.getDomainWithCompleteDomain(root));
int cnt = 0;
while (!urls.isEmpty()) {
String url = urls.poll();
SendReq.ResBody html = doRequest(url, "GET", new HashMap<>());
cnt++;
if (html.getCode().equals(0)) {
ignoreSet.add(DomainUtils.getDomainWithCompleteDomain(url));
try {
GenerateFile.createFile2("moneyframework/generate/ignore", "demo.txt", ignoreSet.toString());
} catch (IOException e) {
e.printStackTrace();
}
continue;
}
OnePage onePage = new OnePage();
onePage.setUrl(url);
onePage.setDomain(DomainUtils.getDomainWithCompleteDomain(url));
onePage.setCode(html.getCode());
String title = HtmlUtil.getTitle(html.getResponce()).trim();
if (!StringUtils.hasText(title) || title.length() > 100 || title.contains("�")) {
title = "没有";
}
onePage.setTitle(title);
String content = HtmlUtil.getContent(html.getResponce());
Set words = Nlp.separateWordAndReturnUnit(content);
List wordStr = Nlp.print2List(new ArrayList<>(words), 10);
handleWord(wordStr, DomainUtils.getDomainWithCompleteDomain(url), title);
onePage.setContent(wordStr.toString());
if (html.getCode().equals(200)) {
List domains = HtmlUtil.getUrls(html.getResponce());
for (String domain : domains) {
int flag = 0;
String[] aaa = domain.split(".");
if (aaa.length>=4){
continue;
}
for (String i : ignoreSet) {
if (domain.endsWith(i)) {
flag = 1;
break;
}
}
if (flag == 1) continue;
if (StringUtils.hasText(domain.trim())) {
if (!tmpDomains.contains(domain)) {
tmpDomains.add(domain);
urls.add("http://" + domain + "/");
}
}
}
logger.info(this.domain + " 队列的大小为 " + urls.size());
if (cnt >= 2000) {
break;
}
} else {
if (url.startsWith("http:")){
urls.add(url.replace("http:","https:"));
}
}
}
}
}
其中,这里的_SendReq.sendReq_是自己实现的一个下载页面你的方法,调用了HttpClient的方法。如果你想实现对Web2.0的抓取,可以考虑在里面封装一个PlayWrite。
然后是格式化Html,去除标签和由于特殊字符引起的各种乱码的工具类HtmlUtils。
import org.apache.commons.lang3.StringEscapeUtils;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class HtmlUtil {
public static String getContent(String html) {
String ans = "";
try {
html = StringEscapeUtils.unescapeHtml4(html);
html = delHTMLTag(html);
html = htmlTextFormat(html);
return html;
} catch (Exception e) {
e.printStackTrace();
}
return ans;
}
public static String delHTMLTag(String htmlStr) {
String regEx_script = "


