数据在脚本标签中。您可以使用bs4和正则表达式获取脚本标签。您也可以使用正则表达式提取数据,但我喜欢使用/
js2xml将js函数解析为xml树:
from bs4 import BeautifulSoupimport requestsimport reimport js2xmlsoup = BeautifulSoup(requests.get("http://www.worldweatheronline.com/brussels-weather-averages/be.aspx").content, "html.parser")script = soup.find("script", text=re.compile("Highcharts.Chart")).text# script = soup.find("script", text=re.compile("precipchartcontainer")).text if you want precipitation dataparsed = js2xml.parse(script)print js2xml.pretty_print(parsed)那给你:
<program> <functioncall> <function> <identifier name="$"/> </function> <arguments> <funcexpr> <identifier/> <parameters/> <body> <var name="chart"/> <functioncall> <function> <dotaccessor> <object> <functioncall> <function><identifier name="$"/> </function> <arguments><identifier name="document"/> </arguments> </functioncall> </object> <property> <identifier name="ready"/> </property> </dotaccessor> </function> <arguments> <funcexpr> <identifier/> <parameters/> <body> <assign operator="="> <left><identifier name="chart"/> </left> <right><new> <dotaccessor> <object> <identifier name="Highcharts"/> </object> <property> <identifier name="Chart"/> </property> </dotaccessor> <arguments> <object> <property name="chart"> <object> <property name="renderTo"> <string>tempchartcontainer</string> </property> <property name="type"> <string>spline</string> </property> </object> </property> <property name="credits"> <object> <property name="enabled"> <boolean>false</boolean> </property> </object> </property> <property name="colors"> <array> <string>#FF8533</string> <string>#4572A7</string> </array> </property> <property name="title"> <object> <property name="text"> <string>Average Temperature (°c) Graph for Brussels</string> </property> </object> </property> <property name="xAxis"> <object> <property name="categories"> <array> <string>January</string> <string>February</string> <string>March</string> <string>April</string> <string>May</string> <string>June</string> <string>July</string> <string>August</string> <string>September</string> <string>October</string> <string>November</string> <string>December</string> </array> </property> <property name="labels"> <object> <property name="rotation"> <number value="270"/> </property> <property name="y"> <number value="40"/> </property> </object> </property> </object> </property> <property name="yAxis"> <object> <property name="title"> <object> <property name="text"> <string>Temperature (°c)</string> </property> </object> </property> </object> </property> <property name="tooltip"> <object> <property name="enabled"> <boolean>true</boolean> </property> </object> </property> <property name="plotOptions"> <object> <property name="spline"> <object> <property name="dataLabels"> <object> <property name="enabled"> <boolean>true</boolean> </property> </object> </property> <property name="enableMouseTracking"> <boolean>false</boolean> </property> </object> </property> </object> </property> <property name="series"> <array> <object> <property name="name"> <string>Average High Temp (°c)</string> </property> <property name="color"> <string>#FF8533</string> </property> <property name="data"> <array> <number value="6"/> <number value="8"/> <number value="11"/> <number value="14"/> <number value="19"/> <number value="21"/> <number value="23"/> <number value="23"/> <number value="19"/> <number value="15"/> <number value="9"/> <number value="6"/> </array> </property> </object> <object> <property name="name"> <string>Average Low Temp (°c)</string> </property> <property name="color"> <string>#4572A7</string> </property> <property name="data"> <array> <number value="2"/> <number value="2"/> <number value="4"/> <number value="6"/> <number value="10"/> <number value="12"/> <number value="14"/> <number value="14"/> <number value="11"/> <number value="8"/> <number value="5"/> <number value="2"/> </array> </property> </object> </array> </property> </object> </arguments></new> </right> </assign> </body> </funcexpr> </arguments> </functioncall> </body> </funcexpr> </arguments> </functioncall></program>
因此,要获取所有数据:
In [28]: from bs4 import BeautifulSoup In [29]: import requestsIn [30]: import re In [31]: import js2xml In [32]: from itertools import repeat In [33]: from pprint import pprint as ppIn [34]: soup = BeautifulSoup(requests.get("http://www.worldweatheronline.com/brussels-weather-averages/be.aspx").content, "html.parser")In [35]: script = soup.find("script", text=re.compile("Highcharts.Chart")).textIn [36]: parsed = js2xml.parse(script)In [37]: data = [d.xpath(".//array/number/@value") for d in parsed.xpath("//property[@name='data']")]In [38]: categories = parsed.xpath("//property[@name='categories']//string/text()")In [39]: output = list(zip(repeat(categories), data)) In [40]: pp(output)[(['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'], ['6', '8', '11', '14', '19', '21', '23', '23', '19', '15', '9', '6']), (['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'], ['2', '2', '4', '6', '10', '12', '14', '14', '11', '8', '5', '2'])]就像我说的那样,您可以只使用正则表达式,但是我发现 js2xml 更加可靠,因为错误的空格等。不会破坏它。



