Scrapy教程 - (3)如何翻頁爬取更多資料

Scrapy教程 - 3如何翻頁爬取更多資料

前言
- 觀察頁面
- 如何翻頁
- 完整代碼
- 瀑布流網站(infinite scroll)怎麼翻頁?

前言

上个教程2完成了一个单页的简单爬虫，但是在实际应用上，我们不会只需要单页的数据，而是多页的数据。因此在爬取多页面数据前，我们要先观察网站/网页的页面变化，才能知道如何有效地获取下一页。

觀察頁面

首先，看一下首页网址:

// 首頁網址
https://books.toscrape.com/catalogue/page-1.html

然后看一下第一页尾，可以发现总共有50页，而且点击右下next可以到下一页。

接着继续观察第二页，

// 第二頁網址
https://books.toscrape.com/catalogue/page-2.html

一样有next可以往下一页，另外还有previous可以回前一页

这里我们做一个小结就是，每一页在转换的时候，网址是随着页数在改变的

// 第1頁
...page-1.html
// 第2頁
...page-2.html
// 以此類推，第50頁
...page-50.html

如何翻頁

记得教程2的parse function吗? 我们要在这个function里面加上解析下一页按钮的链接，如有，我们就让爬虫继续执行；如没有，则停止爬取

def parse(self, response):
	// page_num = 1
	// 找所有書籍的url
	books = response.xpath('//h3/a/@href').extract()
    for book in books:
    	// 將網址前綴與後綴結合
    	url = response.urljoin(book)
        yield response.follow(url = url,
                              callback = self.parse_book)
   // extract 下一頁 url，並發出請求
   next_page_url = response.xpath('//a[text()="next"]/@href').extract_first()
   absolute_next_page_url = response.urljoin(next_page_url)
   yield Request(absolute_next_page_url)

完整代碼

有一些部分，因为较为熟悉就不细述。如阅读者有疑问，欢迎于评论区留言讨论。

from scrapy import Spider
from scrapy.http import Request


class BookWithScrapySpider(Spider):
    name = 'book_with_scrapy'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com']

    def parse(self, response):
        books = response.xpath('//h3/a/@href').extract()
        for book in books:
            url = response.urljoin(book)
            yield response.follow(url = url,
                                  callback = self.parse_book)

        # process next page
        next_page_url = response.xpath('//a[text()="next"]/@href').extract_first()
        absolute_next_page_url = response.urljoin(next_page_url)
        yield Request(url = absolute_next_page_url)

    def parse_book(self, response):
        title = response.xpath('//h1/text()').extract_first()
        price = response.xpath('//*[@]/text()').extract_first()

        image_url = response.xpath('//img/@src').extract_first()
        image_url = image_url.replace('../../', 'http://books.toscrape.com/') #need to change/ pattern

        rating = response.xpath('//*[contains(@class, "star-rating")]/@class').extract_first()
        rating = rating.replace('star-rating', '')

        description = response.xpath('//*[@id="product_description"]/following-sibling::p/text()').extract_first()

        # product information
        upc = product_info(response, 'UPC')
        product_type = product_info(response, 'Product Type')
        price_without_tax = product_info(response, 'Price (excl. tax)')
        price_with_tax = product_info(response, 'Price (incl. tax)')
        tax = product_info(response, 'Tax')
        availability = product_info(response, 'Availability')
        number_of_reviews = product_info(response, 'Number of reviews')


        yield {'title': title,
               'price': price,
               'image_url': image_url,
               'rating': rating,
               'description': description,
               'upc': upc,
               'product_type': product_type,
               'price_without_tax': price_without_tax,
               'price_with_tax': price_with_tax,
               'tax': tax,
               'availability': availability,
               'number_of_reviews': number_of_reviews
            }

瀑布流網站(infinite scroll)怎麼翻頁?

此文范例网站是一个爬虫示范网站，但常常real-world websites大部分都不是这样容易爬取的。你可能会遇到…要爬取网址不会变的网页，javascript rendered的网页, 或是使用infinite scroll的网页…等。特别在面对瀑布流网站，如果不使用逆向工程/API的话，目前最佳解应该就是selenium/playwright/puppeteer等模拟浏览器操作的工具，

以下以vlive為例:

options = Options()
#headless so browser can be instanced without showing the GUI
#options.add_argument("--headless")
options.add_argument('--disable-gpu')
options.add_argument("--incognito")
options.add_argument('user-agent="your_user_agent"')
    
vlive_driver = webdriver.Chrome("C:/Users/user/chromedriver_win32/chromedriver.exe", options = options)
vlive_driver.get("https://www.vlive.tv/channel/DCF447/board/1113")
vlive_driver.maximize_window()

# exclude login
# try to scroll the page when 'div' element show
while vlive_comment_driver.find_element_by_tag_name('div'):
    try:
        vlive_comment_driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(2)
    except TimeoutException as ex:
        print("Exception has been thrown. " + str(ex))
        break

vlive_driver.quit()

学习爬虫就是一边学爬，一边学反爬。了解如何反爬，你才能更精进你的爬虫。

Scrapy教程 - (3)如何翻頁爬取更多資料

Python相关栏目本月热门文章