简短答案:
- Scrapy / Parsel选择
.re()
和.re_first()
方法取代HTML实体(除<
,&
) - 而是使用
.extract()
或.extract_first()
获取原始HTML(或原始Javascript指令),re
并对提取的字符串使用Python的模块
长答案:
让我们看一下示例输入以及从HTML提取Javascript数据的各种方法。
HTML示例:
<html lang="en"><body><div> <script type="text/javascript"> var i = {a:['O'Connor Park']} </script></div></body></html>使用scrapy
Selector(使用下面的parsel库),您可以通过多种方式提取Javascript代码段:
>>> import scrapy>>> t = """<html lang="en">... <body>... <div>... <script type="text/javascript">... var i = {a:['O'Connor Park']}... </script>... ... </div>... </body>... </html>... """>>> selector = scrapy.Selector(text=t, type="html")>>> >>> # extracting the <script> element as raw HTML>>> selector.xpath('//div/script').extract_first()u'<script type="text/javascript">n var i = {a:['O'Connor Park']}n </script>'>>> >>> # only getting the text node inside the <script> element>>> selector.xpath('//div/script/text()').extract_first()u"n var i = {a:['O'Connor Park']}n ">>>现在,使用
.re(或
.re_first)您将获得不同的结果:
>>> # I'm using a very simple "catch-all" regex>>> # you are probably using a regex to extract>>> # that specific "O'Connor Park" string>>> selector.xpath('//div/script/text()').re_first('.+')u" var i = {a:['O'Connor Park']}">>> >>> # .re() on the element itself, one needs to handle newlines>>> selector.xpath('//div/script').re_first('.+')u'<script type="text/javascript">' # only first line extracted>>> import re>>> selector.xpath('//div/script').re_first(re.compile('.+', re.DOTALL))u'<script type="text/javascript">n var i = {a:['O'Connor Park']}n </script>'>>>HTML实体
'已被撇号代替。这是由于实现中的
w3lib.html.replace_entities()调用
.re/re_first(请参见函数中的
parsel源代码
extract_regex),仅在调用
extract()或
extract_first()



