爬虫动态爬取苏宁的商品名称、评论数、价格

爬取苏宁商品信息

- 导入包
- 对Dataframe进行基本的处理
- 设置网页的像素
- 处理评论
- 浏览器初始化
- 爬取过程
- 浏览器的操作
- 写入数据库或存入CSV文件

·爬取苏宁的商品信息我们需要使用chrome浏览器，需要下载相应版本的去驱动，然后将驱动放在解释器的根目录下面，驱动版本要和浏览器的版本一致，下面是下载驱动的链接：
http://npm.taobao.org/mirrors/chromedriver/
下面就是爬取的代码以及详细的解释：

导入包

from time import sleep

from selenium import webdriver
import pandas as pd
import re
from datetime import datetime

对Dataframe进行基本的处理

·第一个是设置最大的列数，如果超过特定数值就会显示省略号，输入参数为None就显示所有的数据
·第二行是设置宽度，横向最多显示150个字符

pd.set_option('display.max_columns', None)
pd.set_option('display.width', 150)

设置网页的像素

js = "var q=document.documentElement.scrollTop=100000"

处理评论

如果评论中含有“万”，则使用split进行分割，转换成int型的评论，如果不含有“万”，则可以直接提取出来评论数

def com_count(text):
    if '万' in text:
        num = float(text.split('万')[0])
        print(num)
        return int(num * 10000)
    else:
        a = re.match('d+', text)
        if a:
            return a.group()
        else:
            return 0

浏览器初始化

Python模块selenium中的webdriver对Chrome的调用，从而进行初始化

driver = webdriver.Chrome()  # 打开浏览器
driver.maximize_window()  # 最大化窗口

爬取过程

接下来就是最主要的爬取数据的函数：第一部分：先定义一个Series一维的数组，然后根据xpath找到正在浏览网页的所有内容，prouduct_list是一个list，然后将list转化为str，就是product_text，然后根据特使符号进行分割，得到的是一个list，每个商品的所有数据作为列表中的一个元素，然后删除空字符串，查看一下列表的长度。第二部分：由分析得出，正常状态下，每页得到的数据应该为120个，有的含有超级会员的商品会有两个价格，这样的话就会导致数据大于120，不准确，然后就进行翻页，此页的数据不要，提取下一页的数据。如果是超级会员的话，进行分析，每个商品看成一个列表，进行分析，不难发现，每个列表的第一个数据是price，第二个数据是名称，第三个数据是评论，然后就可以进行处理了。第三部分：将每个商品的名称，价格，评论数通过遍历存入一个Dataframe二位结构的表中，然后利用append添加进去数据

results = pd.Dataframe()
# 爬取功能主函数
def page_crawl(results):
    res = pd.Series()
    product_list = driver.find_elements_by_xpath('//*[@id="product-list"]')
    product_text = product_list[0].text
    text_list = product_text.split('¥')
    del text_list[0]
    print('len(text_list):', len(text_list))
    if len(text_list) != 120:
        print('超级会员异常，放弃！')
    else:
        print('没有超级会员异常，进行取数！')
        text_list_split = [ii.split('n') for ii in text_list]
        prices = [ii[0] for ii in text_list_split]
        goods = [ii[1] for ii in text_list_split]
        counts = [com_count(ii[2]) for ii in text_list_split]

        # 如果有错误，可以拥入这段代码，具体定位问题
        # for ii in text_list_split:
        #     print(ii[0], ii[1], ii[2])

        for ii in range(len(prices)):
            res.name = goods[ii]
            res['good_name'] = goods[ii]
            res['good_price'] = prices[ii]
            res['com_count'] = counts[ii]

            results = results.append(res)

    return results

浏览器的操作

自己找出url的规律，然后format格式化字符串进行连接，打开浏览器，进行翻页等操作，最后关闭浏览器

for ii in range(10):
    print(datetime.now(), ii)
    url = 'https://search.suning.com/%E5%8F%A3%E7%BD%A9/&iy=0&isNoResult=0&cp={}'.format(ii)
    driver.get(url)  # 打开url
    sleep(5)

    # 下翻页
    driver.execute_script(js)
    sleep(3)
    driver.execute_script(js)
    sleep(2)

    results = page_crawl(results)

print(results.shape)

driver.close()

写入数据库或存入CSV文件

from sqlalchemy import create_engine

conn = create_engine('mysql+pymysql://root:dpb238031@localhost:3306/data?charset=utf8')
results.to_sql('data', conn, index=False, if_exists='replace')

爬虫动态爬取苏宁的商品名称、评论数、价格

Python相关栏目本月热门文章