1. web scraping 网页数据抓取2. web scraping tools 网页抓取工具3. Case Study 案例研究
1. web scraping 网页数据抓取the goal is to extract data from website 目标是从网站中提取数据
(1)noisy,weak labels,can be spammy 嘈杂的、弱的标签可能是垃圾信息
(2)available at scale 可以大规模获取
(3)price comparison/tracking website 价格比较/跟踪网站many ML datasets are obtained by web scraping 许多ML数据集由web抓取获得
(1)imagenet,kineticsweb crawling vs scrapping 网页爬虫vs网页抓取
(1)crawling爬虫: indexing whole pages on internet 索引整个网页在互联网上
(2)scraping抓取:scraping particular data from web pages of a website 从一个网站的网页抓取特定的数据
2. web scraping tools 网页抓取工具
“curl” often doesn’t work “curl”通常不起作用;website owners use various ways to stop bots网站所有者使用各种方法来阻止机器人use headless browser:a web browser without a GUI 使用无头浏览器:一种没有GUI的web浏览器you need a lot of new IPs,easy to get through public clouds 你需要大量新ip,并且能够轻松通过公共云in all IPV4 IPS,AWS owns 1.75%,azure 0.55%,GCP 0.25%
from selenium import webdriver chrome_options = webdriver.ChromeOptions() chrome_options.headless = True chrome = webdriver.chrome( chrome_options=chrome_options) page = chrome.get(url)3. Case Study 案例研究
Query houses sold in near Stanford 在斯坦福附近出售的查询房
You can replace the city and state in the URL for other places 您可以将URL中的城市和州替换为其他地方
Craw individual pages 单独界面抓取
Identify the HTML elements through Inspect通过’Inspect确定HTML元素



