不知道为啥这个js网页是curl不下来的,就将html节点复制了下来,然后用了如下代码爬取:
''' //*[@id="catalog-undefined"]/span/a /body/div[1]/div[5]/div/div/div[1]/div[2]/div[1]/div[1]/nav/div/div[2]/div/div/div/div[1]/div[3]/div[1]/div/div/div[8]/span[3]/span/div/span/span/a /body/div[1]/div[5]/div/div/div[1]/div[2]/div[1]/div[1]/nav/div/div[2]/div/div/div/div[1]/div[3]/div[1]/div/div/div[11]/span[3]/span/div/span/span/a /body/div[1]/div[5]/div/div/div[1]/div[2]/div[1]/div[1]/nav/div/div[2]/div/div/div/div[1]/div[3]/div[1]/div/div//span[3]/span/div/span/span/a ''' xpath='/html/body/div[1]/div[5]/div/div/div[1]/div[2]/div[1]/div[1]/nav/div/div[2]/div/div/div/div[1]/div[3]/div[1]/div/div//span[3]/span/div/span/span/a' file='yashu_body.html' base='https://www.yuque.com' from lxml import etree html = etree.parse(file, etree.HTMLParser()) for e in html.xpath(xpath): print(e.attrib['title']+'n'+base+e.attrib['href'])
不知道有没有更好的方式。



