一旦安装了
Phantom JS,请确保
phantomjs二进制文件在当前路径中可用:
phantomjs --version# result:2.1.1
例
举个例子,我用以下HTML代码创建了一个示例页面。
<!DOCTYPE html><html><head> <meta charset="utf-8"> <title>Javascript scraping test</title></head><body> <p id='intro-text'>No javascript support</p> <script> document.getElementById('intro-text').innerHTML = 'Yay! Supports javascript'; </script> </body></html>如果没有
javascript,它会说:
No javascript support和
javascript:Yay! Supports javascript
没有JS支持的报废:
import requestsfrom bs4 import BeautifulSoupresponse = requests.get(my_url)soup = BeautifulSoup(response.text)soup.find(id="intro-text")# Result:<p id="intro-text">No javascript support</p>
借助JS支持进行报废:
from selenium import webdriverdriver = webdriver.PhantomJS()driver.get(my_url)p_element = driver.find_element_by_id(id_='intro-text')print(p_element.text)# result:'Yay! Supports javascript'
你还可以使用Python库dryscrape抓取javascript驱动的网站。
借助JS支持进行报废:
import dryscrapefrom bs4 import BeautifulSoupsession = dryscrape.Session()session.visit(my_url)response = session.body()soup = BeautifulSoup(response)soup.find(id="intro-text")# Result:<p id="intro-text">Yay! Supports javascript</p>



