栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 面试经验 > 面试问答

如何解决西里尔符号解析html文件的问题?

面试问答 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

如何解决西里尔符号解析html文件的问题?

xml文件

(正如所观察到的,这在系统编码之间有点模糊,尽管在Linux中确实很明显,但在Windows XP中显然无法正常工作。)

我通过解码源字符串-来使其工作

tree = html.fromstring(source.depre('utf-8'))

# -*- coding:cp1251 -*-import lxmlfrom lxml import htmlfilename = "t.html"fread = open(filename, 'r')source = fread.read()tree = html.fromstring(source.depre('utf-8'))fread.close()tags = tree.xpath('//span[@ and text()="Text"]') #This OKprint "name: ",tags[0].textprint "value: ",tags[0].tailtags = tree.xpath('//span[@ and text()="Привет"]') #This is now OK tooprint "name: ",tags[0].textprint "value: ",tags[0].tail

这意味着实际的树是所有

unipre
对象。如果仅将xpath参数设置为a,
unipre
则会找到0个匹配项。

美丽汤

无论如何,我更喜欢将BeautifulSoup用于此类事情。这是我的互动环节;我将文件保存在cp1251中。

>>> from BeautifulSoup import BeautifulSoup>>> filename = '/tmp/cyrillic'>>> fread = open(filename, 'r')>>> source = fread.read()>>> source  # Scary'<html>n<body>n<span >Text</span>some text</br>n<span >xcfxf0xe8xe2xe5xf2</span>xd2xe5xeaxf1xf2 xedxe0 xf0xf3xf1xf1xeaxeexec</br>n</body>n</html>n'>>> source = source.depre('cp1251')  # Let's try getting this right.u'<html>n<body>n<span >Text</span>some text</br>n<span >u041fu0440u0438u0432u0435u0442</span>u0422u0435u043au0441u0442 u043du0430 u0440u0443u0441u0441u043au043eu043c</br>n</body>n</html>n'>>> soup = BeautifulSoup(source)>>> soup  # OK, that's looking right now. Note the </br> was dropped as that's bad HTML with no meaning.<html><body><span >Text</span>some text<span >Привет</span>Текст на русском</body></html>>>> soup.find('span', 'one').findNextSibling(text=True)u'some text'>>> soup.find('span', 'two').findNextSibling(text=True)  # This looks a bit daunting ...u'u0422u0435u043au0441u0442 u043du0430 u0440u0443u0441u0441u043au043eu043c'>>> print _  # ... but it's not, really. Just Unipre chars.Текст на русском>>> # Then you may also wish to get things by text:>>> print soup.find(text=u'Привет').findParent().findNextSibling(text=True)Текст на русском>>> # You can't get things by attributes and the contained NavigableString at the same time, though. That may be a limitation.

最后,考虑尝试

source.depre('cp1251')
而不是
source.depre('utf-8')
从文件系统中获取时可能值得。然后,lxml实际上可以工作。



转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/661600.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号