栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 面试经验 > 面试问答

解析整个网页的html代码

面试问答 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

解析整个网页的html代码

我仍然会坚持使用Twitter API。

另外,这里是解决问题的方法

selenium

  • 使用 Explicit Waits 并定义一个自定义的Expected Condition以等待tweet滚动加载
  • 通过滚动到最后加载的推文
    scrollIntoView()

实现方式:

from selenium import webdriverfrom selenium.common.exceptions import StaleElementReferenceException, TimeoutExceptionfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECclass wait_for_more_than_n_elements_to_be_present(object):    def __init__(self, locator, count):        self.locator = locator        self.count = count    def __call__(self, driver):        try: elements = EC._find_elements(driver, self.locator) return len(elements) > self.count        except StaleElementReferenceException: return Falseurl = "https://twitter.com/ndtv"driver = webdriver.Firefox()driver.maximize_window()driver.get(url)# initial wait for the tweets to loadwait = WebDriverWait(driver, 10)wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "li[data-item-id]")))# scroll down to the last tweet until there is no more tweets loadedwhile True:    tweets = driver.find_elements_by_css_selector("li[data-item-id]")    number_of_tweets = len(tweets)    driver.execute_script("arguments[0].scrollIntoView();", tweets[-1])    try:        wait.until(wait_for_more_than_n_elements_to_be_present((By.CSS_SELECTOR, "li[data-item-id]"), number_of_tweets))    except TimeoutException:        break

这将向下滚动,以将其加载到此通道中的所有现有推文。


这是HTML解析代码段,其中提取了推文:

page_source = driver.page_sourcedriver.close()soup = BeautifulSoup(page_source)for tweet in soup.select("div.tweet div.content"):    print tweet.p.text

它打印:

Father's Day Facebook post by arrested cop Suhas Gokhale's son got nearly 10,000 likes http://goo.gl/aPqlxf  pic.twitter.com/JUqmdWNQ3c#HWL2015 End of third quarter! Breathtaking stuff. India 2-2 Pakistan - http://sports.ndtv.com/hockey/news/244463-hockey-world-league-semifinal-india-vs-pakistan-antwerp …Why these Kashmiri boys may miss their IIT dream http://goo.gl/9LVKfK  pic.twitter.com/gohX21Gibi...


转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/639003.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号