Java爬虫爬简单网站系列（springboot）


org.jsoup
jsoup
1.8.3


 https://mvnrepository.com/artifact/net.sourceforge.htmlunit/htmlunit -->

    net.sourceforge.htmlunit
    htmlunit
    2.33




    org.seleniumhq.selenium
    selenium-java
    2.44.0




    net.sourceforge.tess4j
    tess4j
    4.5.3

发出请求的方式一、直接通过jsoup发起请求，缺点是不能爬取js动态加载的数据；


public static Elements getDoc(String HomeUrl, String divClassName) throws IOException {
    document doc;
    try {
        SslUtils.ignoreSsl();
    } catch (Exception e) {
        e.printStackTrace();
    }

    doc = Jsoup.connect(HomeUrl).userAgent(
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36")
            .get();

    Elements select = doc.select(divClassName);
    return select;
}

二、根据htmlunit发起请求，可以解决上面的缺点，主要是模拟浏览器的环境

public static String htmlJsUtils(String url) {
    URL url1 = null;
    System.out.println("Loading page now-----------------------------------------------: " + url);
    // HtmlUnit 模拟浏览器
    WebClient webClient = new WebClient(BrowserVersion.CHROME);
    webClient.getOptions().setJavascriptEnabled(true);              // 启用JS解释器，默认为true
    webClient.getOptions().setCssEnabled(false);                    // 禁用css支持
    webClient.getOptions().setThrowExceptionOnscriptError(false);   // js运行错误时，是否抛出异常
    webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
    webClient.getOptions().setTimeout(10 * 1000);                   // 设置连接超时时间
    try {
        url1 = new URL(url);
    } catch (MalformedURLException e) {
        e.printStackTrace();
    }
    WebRequest webRequest = new WebRequest(url1, HttpMethod.GET);
    webRequest.setCharset(Encoding.DEFAULT_CHARSET);
    HtmlPage page = null;
    try {
        page = webClient.getPage(webRequest);

    } catch (IOException e) {
        e.printStackTrace();
    }
    webClient.waitForBackgroundJavascript(1 * 1000);               // 等待js后台执行10秒

    String pageAsXml = page.asXml();
    return pageAsXml;

}


返回的是一个字符串，所以你还需要用jsoup解析，通过jsoup.prase(String string)方法可以转换成页面去解析。

   document parse = Jsoup.parse(s1);
   parse.select("h2").toString();//选标签的元素

三、通过chromeDiver模拟浏览器请求

public static String BrowserVersion(String url, WebDriver webDriver) {

    try {
        Thread.sleep(1000*2);
    } catch (InterruptedException e) {
        e.printStackTrace();
    }
    webDriver.get(url);
    String responseBody = webDriver.getPageSource();
    System.out.println(responseBody);

    return responseBody;
}

传一个驱动对象进去，这样就不会频繁创建ChromeDrive对象，他就是浏览器，也就是说跑一次代码只开一个浏览器就行了。


使用该方法时要导入chromedriver.exe    
必须放在安装好的chrome的aoolication目录下。
 System.getProperties().setProperty("webdriver.chrome.driver",
                "C:\Program Files\Google\Chrome\Application\chromedriver.exe");
                
上面的代码，参数一：使用chromedriver驱动，参数二就是驱动的位置。

使用：

public static void dirver (){ System.getProperties().setProperty("webdriver.chrome.driver", "C:Program FilesGoogleChromeApplicationchromedriver.exe"); 
ChromeDriver chromeDriver = new ChromeDriver(); 
String url=""; BrowserVersion(String url, WebDriver webDriver) }

爬取目标的url问题：

打开浏览器按F12，network，点击

可能会遇到浏览器中搜索框里url显示和你现在按f12看到的url不一样，特别是包含中文的，那是因为经过处理了，所以你要处理url1的问题，举中文为例，编码为utf-8，你可能要转成unicode编码。

 URLEncoder.encode（String string）；或
 getBytes(String charsetName)，该方法返回的是一个数组，要字符串就转成string。
 如果是反着来
 可以是getBytes(String charsetName): 使用指定的字符集将字符串编码为 byte 序列，并将结果存储到一个新的 byte 数组中。
 或

public static String toUtf8String(String s) {
        StringBuffer sb = new StringBuffer();
        for (int i = 0; i < s.length(); i++) {
            char c = s.charAt(i);
            if (c >= 0 && c <= 255) {
                sb.append(c);
            } else {
                byte[] b;
                try {
                    b = String.valueOf(c).getBytes("utf-8");
                } catch (Exception ex) {
                    System.out.println(ex);
                    b = new byte[0];
                }
                for (int j = 0; j < b.length; j++) {
                    int k = b[j];
                    if (k < 0)
                        k += 256;
                    sb.append("%" + Integer.toHexString(k).toUpperCase());
                }
            }
        }
        return sb.toString();
    }

现在数据爬好，那要怎么储存？？

可以写到excel里面，方便我们整理数据


    org.apache.poi
    poi-ooxml
    3.17

创建excel表

  public static HSSFWorkbook getHSSFWorkbook(Page page) {
        HSSFWorkbook workbook = new HSSFWorkbook();
        //1.创建表头
        String[] titles = { "country_name",
                "province_name",
                "city_name",
                "countrysize_name",
                "post_id", "latitude", "longitude"};

//        2.创建表对象
        HSSFSheet sheet = workbook.createSheet("users");
//        3.创建标题栏（第一行）  参数为行下标  行下标从0开始
        HSSFRow titleRow = sheet.createRow(page.getDataRow());
//        4.在标题栏中写入数据
        setTitleName(titles, titleRow);
        page.setDataRow(page.getDataRow() + 1);
        return workbook;
    }
    
    
    
      public static void setTitleName(String[] titles, HSSFRow titleRow) {
        int j=0;
        for (int i = 0; i < titles.length;i++) {
            //            创建单元格
            HSSFCell cell = titleRow.createCell(j);
            j=j+2;
            cell.setCellValue(titles[i]);

        }
    }

//打印的方法

public static void WriteCourntryInExcel(Pid pid,Page page, HSSFWorkbook hssfWorkbook) {
    String filename = "日本";
    File file = new File("E:\国家\" + filename + ".xls");
    Boolean aBoolean = WriteCountry(pid, page, file, hssfWorkbook);
    System.out.println(aBoolean);
}
//pid是打印的对象，我把爬取的字符串数据分析后转成对象了；

转对象可以用JSON.parseObject(result, MessageRoot.class);
导包

            com.alibaba
            fastjson
            1.2.74

private static Boolean WriteCountry(Pid pid, Page page, File file, HSSFWorkbook workbook) {
        //声明document类，来存储爬取到的html文档

        HSSFSheet sheet = workbook.getSheet("users");
        HSSFRow row = null;
       
        Integer dataRow = page.getDataRow();
            row = sheet.createRow(dataRow);
            //创建行对象
        page.setDataRow(dataRow + 1);
        
// 创建完对象要将行数+1，因为该行已经被使用了，不加1，数据会被覆盖。

// "0:continent_id", "1:country_id", "2:province_id", "3:city_id", "4:countrysize_id", "5:continent_name", "6:country_name", "7:country_code",
//                "8:province_name", "9:province_code",
//                "10:is_child","11:parent_id",12:"level"
//                "13:city_name", "14:city_code",
//                "15:countrysize_name", "16:countrysize_code",
//                "17:post_id", "18:latitude", "19:longitude"

                row.createCell(0).setCellValue(pid.getCountry());
  //createCell(0)创建行对象的0单元格       //setCellValue（String string）写入数据           row.createCell(2).setCellValue(pid.getProvince());
                row.createCell(4).setCellValue(pid.getCity());
                row.createCell(6).setCellValue(pid.getCountrysize());
                row.createCell(8).setCellValue(pid.getPostId());
                row.createCell(10).setCellValue(pid.getLatitude());
                row.createCell(12).setCellValue(pid.getLongitude());

        writeInto(workbook, file);//写入的文件
        return true;
    }

//没有分析文件夹层面，需要的自己分析，可以自己创文件夹再写进去，如果文件夹没有检测到会报错。
public static void writeInto(HSSFWorkbook workbook, File file) {

    if (!file.exists()) {
        try {
            file.createNewFile();
        } catch (IOException ex) {
            ex.printStackTrace();
        }
    } else {

        try {
            workbook.write(file);
        } catch (IOException ioException) {
            ioException.printStackTrace();
        }

    }

}

加入判断文件夹层面参考

  
    public void writeInFile(String msg, String pathName, String MethodName, String fileType) {

        File file = new File(pathName + "/" + MethodName + "." + fileType);
        String s = pathName + MethodName + "." + fileType;
        String[] split = s.split("/");
        File file1 = new File(pathName);
        String filename ="";
        for (String s1 : split) {
            if (s1 != null && StringUtils.isNotBlank(s1)) {
                if (s1.contains(".txt")){
                    break;
                }
                filename = filename+s1 + "/";
                File file2 = new File(filename);
                if (!file2.exists()) {
                    file2.mkdir();
                }

            }
        }
//        if (!file1.exists()) {
//            try {
//                file1.mkdir();
//            } catch (Exception e) {
//                e.printStackTrace();
//            }
//        }

        if (!file.exists()) {
            try {
                file.createNewFile();
            } catch (Exception e) {
                e.printStackTrace();
            }
        }


        try {

            BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file), "UTF-8"));
            bw.write(msg);
            bw.flush();
            bw.close();
        } catch (UnsupportedEncodingException | FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

使用：

public static void main(String[] args){

Page page =new Page();
page.setDataRow(1);
HSSFWorkbook hssfWorkbook = getHSSFWorkbook(page);
//pid是打印的对象，我把爬取的字符串数据分析后转成对象了；
  WriteCourntryInExcel(pid, page, hssfWorkbook);
}

爬取的流程基本完成，需要注意的是文件的行可能会被覆盖的问题，文件会被覆盖的问题，所以我把整个流程拆分成几个方法。还有人机校验问题，有机会再分享，最普遍就是数字验证码了；

HSSFWorkbook是将数据导入excel的对象，一个文件一个对象。可以定义一个类去处理行数的变化，就是我上面传入的page对象，属性是datarow。

Java爬虫爬简单网站系列（springboot）

Java相关栏目本月热门文章