栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 面试经验 > 面试问答

如何使用python从网站中提取带有匹配单词的html链接

面试问答 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

如何使用python从网站中提取带有匹配单词的html链接

您需要

india
显示的文本中 搜索单词。为此,您需要一个自定义函数:

from bs4 import BeautifulSoupimport requestsurl = "http://www.bbc.com/news/world/asia/"r = requests.get(url)soup = BeautifulSoup(r.content)india_links = lambda tag: (getattr(tag, 'name', None) == 'a' and     'href' in tag.attrs and     'india' in tag.get_text().lower())results = soup.find_all(india_links)

india_links
拉姆达发现都是标记
<a>
与链接
href
属性,包含
india
(不区分大小写)中显示的文本的地方。

注意,我使用了

requests
响应对象
.content
属性。将解码留给BeautifulSoup!

演示:

>>> from bs4 import BeautifulSoup>>> import requests>>> url = "http://www.bbc.com/news/world/asia/">>> r = requests.get(url)>>> soup = BeautifulSoup(r.content)>>> india_links = lambda tag: getattr(tag, 'name', None) == 'a' and 'href' in tag.attrs and 'india' in tag.get_text().lower()>>> results = soup.find_all(india_links)>>> from pprint import pprint>>> pprint(results)[<a href="/news/world/asia/india/">India</a>, <a  href="/news/world-asia-india-30647504" rel="published-1420102077277">India scheme to monitor toilet use </a>, <a  href="/news/world-asia-india-30640444" rel="published-1420022868334">India to scrap tax breaks on cars</a>, <a  href="/news/world-asia-india-30640436" rel="published-1420012598505">India shock over Dhoni retirement</a>, <a href="/news/world/asia/india/">India</a>, <a  href="/news/world-asia-india-30630274" rel="published-1419931669523"><img alt="A Delhi police officer with red flag walks amidst morning fog in Delhi, India, Monday, Dec 29, 2014. " src="http://news.bbcimg.co.uk/media/images/79979000/jpg/_79979280_79979240.jpg"/><span >India fog continues to cause chaos</span></a>, <a  href="/news/world-asia-india-30632852" rel="published-1419940599384"><span >Court boost to India BJP chief</span></a>, <a  href="/sport/0/cricket/30632182" rel="published-1419930930045"><span >India captain Dhoni quits Tests</span></a>, <a  href="http://www.bbc.co.uk/news/world-radio-and-tv-15386555" rel="published-1392018507550"><img alt="A woman riding a scooter waits for a traffic signal along a street in Mumbai February 5, 2014." src="http://news.bbcimg.co.uk/media/images/72866000/jpg/_72866856_020889093.jpg"/>Special report: India Direct</a>, <a href="/2/hi/south_asia/country_profiles/1154019.stm">India</a>]

注意

http://www.bbc.co.uk/news/world-radio-and-tv-15386555
这里的链接;我们必须使用
lambda
搜索,因为带有
text
正则表达式的搜索不会找到该元素;包含的文本(
特殊报告:India Direct )不是标签中的唯一元素,因此无法找到。

/news/world-asia-india-30632852
链接也有类似问题。嵌套
<span>
元素使得 法院提升印度BJP首席
标题文本不是link标记的直接子元素。

您可以提取 与链接:

from urllib.parse import urljoinresult_links = [urljoin(url, tag['href']) for tag in results]

相对于原始URL解析所有相对URL的位置:

>>> from urllib.parse import urljoin>>> result_links = [urljoin(url, tag['href']) for tag in results]>>> pprint(result_links)['http://www.bbc.com/news/world/asia/india/', 'http://www.bbc.com/news/world-asia-india-30647504', 'http://www.bbc.com/news/world-asia-india-30640444', 'http://www.bbc.com/news/world-asia-india-30640436', 'http://www.bbc.com/news/world/asia/india/', 'http://www.bbc.com/news/world-asia-india-30630274', 'http://www.bbc.com/news/world-asia-india-30632852', 'http://www.bbc.com/sport/0/cricket/30632182', 'http://www.bbc.co.uk/news/world-radio-and-tv-15386555', 'http://www.bbc.com/2/hi/south_asia/country_profiles/1154019.stm']


转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/625572.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号