用java实现异步爬虫并实现数据库mysql的插入

文章目录

- 用java实现异步爬虫并实现数据库mysql的插入
- 首先是maven依赖
- 项目结构
- 建立bean类
- 爬虫
- 获取response
- 数据库的注册
- 使用Query.Runner进行数据库操作

当我们用爬虫爬取网站源代码时，会出现没有数据的情况，是因为数据都在js里面，没有加载，用异步爬虫可以解决js无法加载的问题

首先是maven依赖



    4.0.0

    pesticate1
    1
    1.0-SNAPSHOT

    
    
        org.jsoup
        jsoup
        1.14.2
    
    
    
        net.sourceforge.htmlunit
        htmlunit
        2.43.0
    

    
        commons-dbutils
        commons-dbutils
        1.6
    
    
        commons-dbcp
        commons-dbcp
        1.4
    
    
        mysql
        mysql-connector-java
        8.0.25

项目结构

建立bean类

public class Modell {
//    插入的bean
    private int id;
    private String title;
    private String add;

    public int getId() {
        return id;
    }

    public void setId(int id) {
        this.id = id;
    }

    public String getTitle() {
        return title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public String getAdd() {
        return add;
    }

    public void setAdd(String add) {
        this.add = add;
    }
}

我们的项目要用到htmlunit获取信息和jsoup进行信息分类

爬虫

/使用htmlunit框架爬取渲染完成后的网页
//另一个是jsoup框架，用来分类数据
public class URLFecter {
    public static List URLparser( String url) throws Exception {
        HttpClient client = HttpClientBuilder.create().build();
//        创建client请求对象
//     遇到困难   用defaulthttpclient方法失败，查询版本更新只能用httpclientbuilder.create.build方法
        List Data = new ArrayList();
//        创建返回对象
        HttpResponse response = HTTPUtils.getHtml(client, url);
//        response请求
        int StatusCode = response.getStatusLine().getStatusCode();
//       查看状态码
        System.out.println(StatusCode);
        if (StatusCode == 200) {
//            表示连接成功
            final WebClient webClient=new WebClient(BrowserVersion.CHROME);
//            设置浏览器为谷歌浏览器
            webClient.getOptions().setThrowExceptionOnscriptError(false);
//            不抛出js异常
            webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
//            不抛出状态码异常
            webClient.getOptions().setActiveXNative(false);
//            不启动ActiveX
            webClient.getOptions().setCssEnabled(false);
//            不启动css
            webClient.getOptions().setJavascriptEnabled(true);
//            启动js
            webClient.getOptions().setDownloadImages(false);
//            不下载图片
            webClient.setAjaxController(new NicelyResynchronizingAjaxController());
//          支持ajax写法的js代码
            HtmlPage page=null;

            try{
                page=webClient.getPage(url);
//               抓取网页

            }catch (Exception e){
                e.printStackTrace();
            }finally {
                webClient.close();
            }
            webClient.waitForBackgroundJavascript(5000);
//            设置等待js的时间为1秒
            String pageXml=page.asXml();
//            将页面转化为xml格式
            Data = Parse.getData(pageXml);
//            进行数据分类
        } else {
            System.out.println("服务器未响应！！！！！！");
        }
        return Data;
    }
}

获取response

public abstract class HTTPUtils{
    public static HttpResponse getHtml(HttpClient client,String url)throws ClientProtocolException,IOException {
        HttpGet getmeth=new HttpGet(url);
//        httpget是httpclient的工具类
        HttpResponse response=new BasicHttpResponse(HttpVersion.HTTP_1_1, HttpStatus.SC_OK,"OK");
//        获取response
        try {
            response=client.execute(getmeth);
//            执行get请求
            System.out.println(response);
        }catch (IOException e){
            e.printStackTrace();
        }

        return response;
//        返回执行的get请求
    }

}

数据库的注册

public class MydataSource {
//    注册数据库
    public  static DataSource getData(String conn){
        BasicDataSource ds=new BasicDataSource();
        ds.setDriverClassName("com.mysql.jdbc.Driver");
        ds.setUsername("root");
        ds.setPassword("mysql");
        ds.setUrl(conn);
        return ds;
    }
}

使用Query.Runner进行数据库操作

public class Mysql {
    static DataSource ds = MydataSource.getData("jdbc:mysql://localhost:3307/skd?characterEncoding=utf-8");
    static DataSource dss = MydataSource.getData("jdbc:mysql://localhost:3307/qcf?characterEncoding=utf-8");
    static QueryRunner qr = new QueryRunner(ds);
    static QueryRunner qrr = new QueryRunner(dss);

    //创建qr对象准备执行sql语句
    public static void executeInsert(List data) throws SQLException {
//寻找表尾位置进行插入
        Object m = qrr.query("SELECt MAX(id) FROM q", new ScalarHandler

用java实现异步爬虫并实现数据库mysql的插入

大数据系统相关栏目本月热门文章