配置maven环境
下载浏览器驱动,并引入;
前往华为云镜像站下载谷歌浏览器驱动
https://mirrors.huaweicloud.com/home
要下载与自己电脑上谷歌浏览器版本相匹配的;
引入pom.xml依赖
开始爬取org.seleniumhq.selenium selenium-java 3.141.59 org.seleniumhq.selenium selenium-api 3.141.59
目标网址
https://search.jd.com/Search?keyword=iphone%2013&psort=3&wq=iphone%2013&psort=3&pvid=37f665a84459460eb713df08bdcd7799&page=1&s=1&click=0
目标是爬取所有47页的商品的 价格,商品名称,商品描述;
第一页网址:
https://search.jd.com/Search?keyword=iphone%2013&psort=3&wq=iphone%2013&psort=3&pvid=37f665a84459460eb713df08bdcd7799&page=1&s=1&click=0
第二页网址:
https://search.jd.com/Search?keyword=iphone%2013&psort=3&wq=iphone%2013&psort=3&pvid=37f665a84459460eb713df08bdcd7799&page=3&s=61&click=0
第三页网址:
https://search.jd.com/Search?keyword=iphone%2013&psort=3&wq=iphone%2013&psort=3&pvid=37f665a84459460eb713df08bdcd7799&page=5&s=121&click=0
经过分析发现规律: page=1 , 3 , 5 为奇数 --2i+1
s =1,60,121 依次加60-- ((i-1)*60)+1
定义一个浏览器对象
ChromeDriver webDriver = new ChromeDriver();
利用循环爬取45页的数据
for(int i=1;i<45;i++) {
webDriver.get("https://search.jd.com/search?keyword=iphone%2013&wq=iphone%2013&cid3=655&psort=3&page="+((2*i)-1)+"&s="+((i-1)*60)+1+"&click=0");
将页面滑到最底部,因为京东页面若不滑到最下面会导致有些数据加载不出来
((JavascriptExecutor)webDriver).executescript("window.scrollTo(0,document.body.scrollHeight)");
获取源码
String pageSource = webDriver.getPageSource();
选中一个单元获取其路径
#J_goodsList > ul > li:nth-child(13) > div
#J_goodsList > ul > li:nth-child > div
查看价格路径
#J_goodsList > ul > li:nth-child(16) > div > div.p-price > strong > i
在其循环体下
div.p-price > strong > i
同理:店铺名称路径 div.p-shop > span > a
商品描述路径:div.p-name.p-name-type-2 > a > em
还要创建一个product类并建对象:
package com.zygxy.shop;
public class Product {
private String price;
private String shopname;
private String shopcontext;
public String getPrice() {
return price;
}
public void setPrice(String price) {
this.price = price;
}
public String getShopname() {
return shopname;
}
public void setShopname(String shopname) {
this.shopname = shopname;
}
public String getShopcontext() {
return shopcontext;
}
public void setShopcontext(String shopcontext) {
this.shopcontext = shopcontext;
}
}
全部代码
package com.zygxy.shop;
import com.alibaba.fastjson.JSONObject;
import lombok.extern.slf4j.Slf4j;
import org.jsoup.Jsoup;
import org.jsoup.nodes.document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.openqa.selenium.JavascriptExecutor;
import org.openqa.selenium.chrome.ChromeDriver;
import java.io.IOException;
import java.util.Properties;
@Slf4j
public class JD {
public static void main(String[] args) throws IOException {
Properties properties = new Properties();
properties.load(JD.class.getClassLoader().getResourceAsStream("application.properties"));
System.out.println(properties.getProperty("chromedriver"));
System.setProperty("webdriver.chrome.driver",properties.getProperty("chromedriver"));
ChromeDriver webDriver = new ChromeDriver();
for (int i=1;i<45;i++){
webDriver.get("https://search.jd.com/search?keyword=iphone%2013&psort=3&wq=iphone%2013&psort=3&pvid=aa23e9f58a714e9087f316ab6aa993bd&cid3=655&cid2=653&page="+((2*i)-1)+"&s="+((i-1)*60)+1+"&click=0");
((JavascriptExecutor) webDriver).executescript("window.scrollTo(0,document.body.scrollHeight)");
String pageSource = webDriver.getPageSource();
document parse = Jsoup.parse(pageSource);
Elements select = parse.select("#J_goodsList > ul > li > div ");
for(Element e : select){
Product product = new Product();
Element price = e.selectFirst(" div.p-price > strong > i");
product .setPrice(price.text());
Element shop_name = e.selectFirst(" div.p-shop > span > a");
product.setShopname(shop_name.text());
Element shopcontext = e.selectFirst(" div.p-name.p-name-type-2 > a > em");
product.setShopcontext(shopcontext.text());
String json = JSONObject.toJSONString(product);
log.info(json);
}
}
}
}
pom.xml
resources配置4.0.0 org.example NovelParse 1.0-SNAPSHOT 8 8 org.jsoup jsoup 1.13.1 commons-io commons-io 2.4 org.seleniumhq.selenium selenium-java 3.141.59 org.seleniumhq.selenium selenium-api 3.141.59 com.alibaba fastjson 1.2.75 ch.qos.logback logback-core 1.2.3 ch.qos.logback logback-classic 1.2.3 org.slf4j slf4j-api 1.7.25 compile org.projectlombok lombok 1.18.12 provided maven-assembly-plugin jar-with-dependencies com.zygxy.JD make-assembly package single
application.properties
本地驱动地址
chromedriver =E:shixunpachongsrcmainresourceschromedriver.exe
logback.xml 定义了日志的输出地址和格式
%msg%n ${LOG_HOME}/shop.%d{yyyy-MM-dd}.log %msg%n



