栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 软件开发 > 后端开发 > Python

python 爬虫 学习笔记

Python 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

python 爬虫 学习笔记

爬取东方财富网的资金流向数据 获取数据

图1


只需要红框内的表格数据
按F12进入开发者界面
图2

再按Ctrl+R重新运行当前界面
图3

这个文件中包含我们想要的数据
点击Headers
图4

复制url
http://push2.eastmoney.com/api/qt/clist/get?cb=jQuery112304931261550818704_1635070090942&fid=f62&po=1&pz=50&pn=1&np=1&fltt=2&invt=2&ut=b2884a393a59ad64002292a3e90d46a5&fs=m%3A0%2Bt%3A6%2Bf%3A!2%2Cm%3A0%2Bt%3A13%2Bf%3A!2%2Cm%3A0%2Bt%3A80%2Bf%3A!2%2Cm%3A1%2Bt%3A2%2Bf%3A!2%2Cm%3A1%2Bt%3A23%2Bf%3A!2%2Cm%3A0%2Bt%3A7%2Bf%3A!2%2Cm%3A1%2Bt%3A3%2Bf%3A!2&fields=f12%2Cf14%2Cf2%2Cf3%2Cf62%2Cf184%2Cf66%2Cf69%2Cf72%2Cf75%2Cf78%2Cf81%2Cf84%2Cf87%2Cf204%2Cf205%2Cf124%2Cf1%2Cf13
在python中requests模块可以用get获取

import requests
url = 'http://push2.eastmoney.com/api/qt/clist/get?cb=jQuery112301936958418195105_1634991380727&fid=f62&po=1&pz=50&pn=1&np=1&fltt=2&invt=2&ut=b2884a393a59ad64002292a3e90d46a5&fs=m%3A0%2Bt%3A6%2Bf%3A!2%2Cm%3A0%2Bt%3A13%2Bf%3A!2%2Cm%3A0%2Bt%3A80%2Bf%3A!2%2Cm%3A1%2Bt%3A2%2Bf%3A!2%2Cm%3A1%2Bt%3A23%2Bf%3A!2%2Cm%3A0%2Bt%3A7%2Bf%3A!2%2Cm%3A1%2Bt%3A3%2Bf%3A!2&fields=f12%2Cf14%2Cf2%2Cf3%2Cf62%2Cf184%2Cf66%2Cf69%2Cf72%2Cf75%2Cf78%2Cf81%2Cf84%2Cf87%2Cf204%2Cf205%2Cf124%2Cf1%2Cf13'
r = requests.get(url).text
print(r)

图5

r中包含需要的数据text将其转成了字符串格式
在图4中最后一行就是原网页表格对应的变量名
r储存着对应的值

数据清洗

获取每一列数据
需要re模块

import requests
import re
url = 'http://push2.eastmoney.com/api/qt/clist/get?cb=jQuery112301936958418195105_1634991380727&fid=f62&po=1&pz=50&pn=1&np=1&fltt=2&invt=2&ut=b2884a393a59ad64002292a3e90d46a5&fs=m%3A0%2Bt%3A6%2Bf%3A!2%2Cm%3A0%2Bt%3A13%2Bf%3A!2%2Cm%3A0%2Bt%3A80%2Bf%3A!2%2Cm%3A1%2Bt%3A2%2Bf%3A!2%2Cm%3A1%2Bt%3A23%2Bf%3A!2%2Cm%3A0%2Bt%3A7%2Bf%3A!2%2Cm%3A1%2Bt%3A3%2Bf%3A!2&fields=f12%2Cf14%2Cf2%2Cf3%2Cf62%2Cf184%2Cf66%2Cf69%2Cf72%2Cf75%2Cf78%2Cf81%2Cf84%2Cf87%2Cf204%2Cf205%2Cf124%2Cf1%2Cf13'
r = requests.get(url).text
f12=re.findall(""f12":"d*.{0,1}d*"",r)
f14=re.findall(""f14":".{3,4}"{0,1}",r)
f2=re.findall(""f2":d*.{0,1}d*",r)
f3=re.findall(""f3":d*.{0,1}d*",r)
f62=re.findall(""f62":d*.{0,1}d*",r)
f184=re.findall(""f184":d*.{0,1}d*",r)
f66=re.findall(""f66":-{0,1}d*.{0,1}d*",r)
f69=re.findall(""f69":-{0,1}d*.{0,1}d*",r)
f72=re.findall(""f72":-{0,1}d*.{0,1}d*",r)
f75=re.findall(""f75":-{0,1}d*.{0,1}d*",r)
f78=re.findall(""f78":-{0,1}d*.{0,1}d*",r)
f81=re.findall(""f81":-{0,1}d*.{0,1}d*",r)
f84=re.findall(""f84":-{0,1}d*.{0,1}d*",r)
f87=re.findall(""f87":-{0,1}d*.{0,1}d*",r)

结果如下:
图6

对应每列数据大同小异,用正则表达式匹配
学习正则表达式的链接:
https://www.cnblogs.com/magicking/p/8986869.html

将每一行数据提取出来并存入一个表中作为一个元素
需要用到pandas模块和re模块

import re
import pandas as pd
info=[]
for i in range(50):
    dm=re.findall("d*",f12[i])[6]
    name=re.findall("w*s{0,1}w*",f14[i])[5]
    zxj=float(re.findall("d*.{0,1}d*", f2[i])[5])
    tdzdf=float(re.findall("d*.{0,1}d*",f3[i])[5])
    zlje=float(re.findall("d*.{0,1}d*",f62[i])[5])
    zljzb=float(re.findall("d*.{0,1}d*",f184[i])[5])
    sblrje=float(re.findall("-{0,1}d*.{0,1}d*",f66[i])[5])
    sblrjzb=float(re.findall("-{0,1}d*.{0,1}d*",f69[i])[5])
    blrje=float(re.findall("-{0,1}d*.{0,1}d*",f72[i])[5])
    blrjzb=float(re.findall("-{0,1}d*.{0,1}d*",f75[i])[5])
    mlrje=float(re.findall("-{0,1}d*.{0,1}d*",f78[i])[5])
    mlrjzb=float(re.findall("-{0,1}d*.{0,1}d*",f81[i])[5])
    llrje=float(re.findall("-{0,1}d*.{0,1}d*",f84[i])[5])
    llrjzb=float(re.findall("-{0,1}d*.{0,1}d*",f87[i])[5])
    info.append(pd.Dataframe({'dm':dm,'name':name,'zxj':zxj,
                    'tdzdf':tdzdf,'zlje':zlje,'zljzb':zljzb,
                    'sblrje':sblrje,'sblrjzb':sblrjzb,
                    'mlrje':mlrje,'mlrjzb':mlrjzb,
                    'llrje':llrje,'llrjzb':llrjzb},index=[i]))
sj=pd.concat(info)

最后一行将数据进行合并,生成表格

数据转换

由于python中数据不会被保存故导出为excel文件存储
学习链接:https://www.cnblogs.com/wtmb/p/13501463.html

sj.to_excel('D:myfilesj.xlsx',sheet_name="p1",index=False)

图7

打开结果
图8

读取excel数据,在转为mysql数据并打印出来
需要pymysql模块

import pymysql

读取数据
学习链接:https://www.cnblogs.com/lj821022/p/8232764.html

data=pd.read_excel('D:myfilesj.xlsx')

连接数据库并转为数据库表

data.to_sql(name='nt',con='mysql+pymysql://root:123456@localhost:3306/mysql?charset=utf8',if_exists='replace',index=False)

连接数据库,生成游标,输出
学习连接:https://blog.csdn.net/kongsuhongbaby/article/details/84948205

db = pymysql.connect(host='localhost',user='root',password='123456',database='mysql',port=3306)
cursor = db.cursor()
cursor.execute('''select * from nt''')
results = cursor.fetchall()
for row in results:
    print(row)

关闭游标,断开链接

cursor.close()
db.close()
完整代码
import requests
import pandas as pd
import re
import pymysql
url = 'http://push2.eastmoney.com/api/qt/clist/get?cb=jQuery112301936958418195105_1634991380727&fid=f62&po=1&pz=50&pn=1&np=1&fltt=2&invt=2&ut=b2884a393a59ad64002292a3e90d46a5&fs=m%3A0%2Bt%3A6%2Bf%3A!2%2Cm%3A0%2Bt%3A13%2Bf%3A!2%2Cm%3A0%2Bt%3A80%2Bf%3A!2%2Cm%3A1%2Bt%3A2%2Bf%3A!2%2Cm%3A1%2Bt%3A23%2Bf%3A!2%2Cm%3A0%2Bt%3A7%2Bf%3A!2%2Cm%3A1%2Bt%3A3%2Bf%3A!2&fields=f12%2Cf14%2Cf2%2Cf3%2Cf62%2Cf184%2Cf66%2Cf69%2Cf72%2Cf75%2Cf78%2Cf81%2Cf84%2Cf87%2Cf204%2Cf205%2Cf124%2Cf1%2Cf13'
r = requests.get(url).text
f12=re.findall(""f12":"d*.{0,1}d*"",r)
f14=re.findall(""f14":".{3,4}"{0,1}",r)
f2=re.findall(""f2":d*.{0,1}d*",r)
f3=re.findall(""f3":d*.{0,1}d*",r)
f62=re.findall(""f62":d*.{0,1}d*",r)
f184=re.findall(""f184":d*.{0,1}d*",r)
f66=re.findall(""f66":-{0,1}d*.{0,1}d*",r)
f69=re.findall(""f69":-{0,1}d*.{0,1}d*",r)
f72=re.findall(""f72":-{0,1}d*.{0,1}d*",r)
f75=re.findall(""f75":-{0,1}d*.{0,1}d*",r)
f78=re.findall(""f78":-{0,1}d*.{0,1}d*",r)
f81=re.findall(""f81":-{0,1}d*.{0,1}d*",r)
f84=re.findall(""f84":-{0,1}d*.{0,1}d*",r)
f87=re.findall(""f87":-{0,1}d*.{0,1}d*",r)
info=[]
for i in range(50):
    dm=re.findall("d*",f12[i])[6]
    name=re.findall("w*s{0,1}w*",f14[i])[5]
    zxj=float(re.findall("d*.{0,1}d*", f2[i])[5])
    tdzdf=float(re.findall("d*.{0,1}d*",f3[i])[5])
    zlje=float(re.findall("d*.{0,1}d*",f62[i])[5])
    zljzb=float(re.findall("d*.{0,1}d*",f184[i])[5])
    sblrje=float(re.findall("-{0,1}d*.{0,1}d*",f66[i])[5])
    sblrjzb=float(re.findall("-{0,1}d*.{0,1}d*",f69[i])[5])
    blrje=float(re.findall("-{0,1}d*.{0,1}d*",f72[i])[5])
    blrjzb=float(re.findall("-{0,1}d*.{0,1}d*",f75[i])[5])
    mlrje=float(re.findall("-{0,1}d*.{0,1}d*",f78[i])[5])
    mlrjzb=float(re.findall("-{0,1}d*.{0,1}d*",f81[i])[5])
    llrje=float(re.findall("-{0,1}d*.{0,1}d*",f84[i])[5])
    llrjzb=float(re.findall("-{0,1}d*.{0,1}d*",f87[i])[5])
    info.append(pd.Dataframe({'dm':dm,'name':name,'zxj':zxj,
                    'tdzdf':tdzdf,'zlje':zlje,'zljzb':zljzb,
                    'sblrje':sblrje,'sblrjzb':sblrjzb,
                    'mlrje':mlrje,'mlrjzb':mlrjzb,
                    'llrje':llrje,'llrjzb':llrjzb},index=[i]))
sj=pd.concat(info)
sj.to_excel('D:myfilesj.xlsx',sheet_name="p1",index=False)
data=pd.read_excel('D:myfilesj.xlsx')
data.to_sql(name='nt',con='mysql+pymysql://root:123456@localhost:3306/mysql?charset=utf8',if_exists='replace',index=False)
db = pymysql.connect(host='localhost',user='root',password='123456',database='mysql',port=3306)
cursor = db.cursor()
cursor.execute('''select * from nt''')
results = cursor.fetchall()
for row in results:
    print(row)
cursor.close()
db.close()
转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/348100.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号