解决此问题的一种可能方法
a是在打印出元素文本时对元素进行一些特殊处理。
您可以通过重写
_all_strings()方法并返回
a后代元素的字符串表示形式并跳过
a元素内的可导航字符串来实现。遵循以下原则:
from bs4 import BeautifulSoup, NavigableString, CData, Tagclass MyBeautifulSoup(BeautifulSoup): def _all_strings(self, strip=False, types=(NavigableString, CData)): for descendant in self.descendants: # return "a" string representation if we encounter it if isinstance(descendant, Tag) and descendant.name == 'a': yield str(descendant) # skip an inner text node inside "a" if isinstance(descendant, NavigableString) and descendant.parent.name == 'a': continue # default behavior if ( (types is None and not isinstance(descendant, NavigableString)) or (types is not None and type(descendant) not in types)): continue if strip: descendant = descendant.strip() if len(descendant) == 0: continue yield descendant
演示:
In [1]: data = """ ...: <td> ...: <font><span>Hello</span><span>World</span></font><br> ...: <span>Foo Bar <span>Baz</span></span><br> ...: <span>Example link: <a href="https://google.com" target="_blank" >Google</a></span> ...: </td> ...: """In [2]: soup = MyBeautifulSoup(data, "lxml")In [3]: print(soup.get_text())HelloWorldFoo Bar BazExample link: <a href="https://google.com" target="_blank">Google</a>



