栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 软件开发 > 后端开发 > Python

Java爬虫技术(二)爬取京东iPhone商品信息并生成Json日志

Python 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

Java爬虫技术(二)爬取京东iPhone商品信息并生成Json日志

准备

配置maven环境
下载浏览器驱动,并引入;

下载浏览器驱动

前往华为云镜像站下载谷歌浏览器驱动

https://mirrors.huaweicloud.com/home 


要下载与自己电脑上谷歌浏览器版本相匹配的;

引入pom.xml依赖

        
            org.seleniumhq.selenium
            selenium-java
            3.141.59
        

        
            org.seleniumhq.selenium
            selenium-api
            3.141.59
        
开始爬取

目标网址

https://search.jd.com/Search?keyword=iphone%2013&psort=3&wq=iphone%2013&psort=3&pvid=37f665a84459460eb713df08bdcd7799&page=1&s=1&click=0


目标是爬取所有47页的商品的 价格,商品名称,商品描述;

第一页网址:

https://search.jd.com/Search?keyword=iphone%2013&psort=3&wq=iphone%2013&psort=3&pvid=37f665a84459460eb713df08bdcd7799&page=1&s=1&click=0

第二页网址:

https://search.jd.com/Search?keyword=iphone%2013&psort=3&wq=iphone%2013&psort=3&pvid=37f665a84459460eb713df08bdcd7799&page=3&s=61&click=0

第三页网址:

https://search.jd.com/Search?keyword=iphone%2013&psort=3&wq=iphone%2013&psort=3&pvid=37f665a84459460eb713df08bdcd7799&page=5&s=121&click=0

经过分析发现规律: page=1 , 3 , 5 为奇数 --2i+1
s =1,60,121 依次加60-- ((i-1)*60)+1

主要代码与操作步骤

定义一个浏览器对象

ChromeDriver webDriver = new ChromeDriver();

利用循环爬取45页的数据

for(int i=1;i<45;i++) {
            webDriver.get("https://search.jd.com/search?keyword=iphone%2013&wq=iphone%2013&cid3=655&psort=3&page="+((2*i)-1)+"&s="+((i-1)*60)+1+"&click=0");

将页面滑到最底部,因为京东页面若不滑到最下面会导致有些数据加载不出来

((JavascriptExecutor)webDriver).executescript("window.scrollTo(0,document.body.scrollHeight)");

获取源码

 String pageSource = webDriver.getPageSource();

选中一个单元获取其路径

#J_goodsList > ul > li:nth-child(13) > div
#J_goodsList > ul > li:nth-child > div

查看价格路径

#J_goodsList > ul > li:nth-child(16) > div > div.p-price > strong > i

在其循环体下

 div.p-price > strong > i

同理:店铺名称路径 div.p-shop > span > a
商品描述路径:div.p-name.p-name-type-2 > a > em

还要创建一个product类并建对象:

package com.zygxy.shop;

public class Product {
    private String price;
    private String shopname;
    private String shopcontext;

    public String getPrice() {
        return price;
    }

    public void setPrice(String price) {
        this.price = price;
    }

    public String getShopname() {
        return shopname;
    }

    public void setShopname(String shopname) {
        this.shopname = shopname;
    }

    public String getShopcontext() {
        return shopcontext;
    }

    public void setShopcontext(String shopcontext) {
        this.shopcontext = shopcontext;
    }
}
全部代码
package com.zygxy.shop;

import com.alibaba.fastjson.JSONObject;
import lombok.extern.slf4j.Slf4j;
import org.jsoup.Jsoup;
import org.jsoup.nodes.document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.openqa.selenium.JavascriptExecutor;
import org.openqa.selenium.chrome.ChromeDriver;

import java.io.IOException;
import java.util.Properties;
@Slf4j
public class JD {
    public static void main(String[] args) throws IOException {
        Properties properties = new Properties();
        properties.load(JD.class.getClassLoader().getResourceAsStream("application.properties"));
        System.out.println(properties.getProperty("chromedriver"));
        System.setProperty("webdriver.chrome.driver",properties.getProperty("chromedriver"));
        ChromeDriver webDriver = new ChromeDriver();
        for (int i=1;i<45;i++){
            webDriver.get("https://search.jd.com/search?keyword=iphone%2013&psort=3&wq=iphone%2013&psort=3&pvid=aa23e9f58a714e9087f316ab6aa993bd&cid3=655&cid2=653&page="+((2*i)-1)+"&s="+((i-1)*60)+1+"&click=0");
            ((JavascriptExecutor) webDriver).executescript("window.scrollTo(0,document.body.scrollHeight)");
            String pageSource = webDriver.getPageSource();
            document parse = Jsoup.parse(pageSource);
            Elements select = parse.select("#J_goodsList > ul > li > div ");
            for(Element e : select){
                Product product = new Product();
                Element price = e.selectFirst(" div.p-price > strong > i");
                product .setPrice(price.text());
                Element shop_name = e.selectFirst(" div.p-shop > span > a");
                product.setShopname(shop_name.text());
                Element shopcontext  = e.selectFirst(" div.p-name.p-name-type-2 > a > em");
                product.setShopcontext(shopcontext.text());
                String json = JSONObject.toJSONString(product);
                log.info(json);
            }
        }
    }
}
pom.xml


    4.0.0

    org.example
    NovelParse
    1.0-SNAPSHOT

    
        8
        8
    


    
        
        
            org.jsoup
            jsoup
            1.13.1
        

        
        
            commons-io
            commons-io
            2.4
        

        
        
            org.seleniumhq.selenium
            selenium-java
            3.141.59
        

        
        
            org.seleniumhq.selenium
            selenium-api
            3.141.59
        



        
        
            com.alibaba
            fastjson
            1.2.75
        

        
        
            ch.qos.logback
            logback-core
            1.2.3
        

        
        
            ch.qos.logback
            logback-classic
            1.2.3
        



        
            org.slf4j
            slf4j-api
            1.7.25
            compile
        


        
        
            org.projectlombok
            lombok
            1.18.12
            provided
        
    


    
        
            
            

            
            
                maven-assembly-plugin
                
                    
                        jar-with-dependencies
                    
                    
                        
                            com.zygxy.JD
                        
                    
                
                
                    
                        make-assembly
                        package
                        
                            single
                        
                    
                
            
        
    

resources配置

application.properties
本地驱动地址
chromedriver =E:shixunpachongsrcmainresourceschromedriver.exe

logback.xml 定义了日志的输出地址和格式



    
    
        
            %msg%n
        
    

    
        
            ${LOG_HOME}/shop.%d{yyyy-MM-dd}.log
        
        
            %msg%n
        
    


    
    
        
        

    


转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/664346.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号