目录
爬虫爬取思路
python代码
数据库代码
后期发现:
解决方法:
词云制作
爬虫爬取思路
python代码
import requests # 请求
from lxml import etree
import MySQLdb
from fake_useragent import UserAgent
import time
dish = MySQLdb.connect(
host='localhost',
user='root',
passwd='123456',
db='xiachufang'
)
cur = dish.cursor()
dishname1 = ''
materials1 = ''
dishurl1 = ''
list1 = []
f = open("dish.text", mode="w", encoding="utf-8")
def insert(dishname, materials, dishurl):
sql='insert into dish(dishname, materials, dishurl) values(%s, %s, %s)'
params = (dishname, materials, dishurl)
cur.execute(sql, params)
ua = UserAgent()
headers = {
'User-Agent': ua.random
}
url1 = 'https://www.xiachufang.com/explore?page={}'
resp = requests.get(url1, headers=headers)
for index in range(10):
resp = requests.get(url1.format(index), headers=headers)
#print(resp.text)
html1 = etree.HTML(resp.text)
time.sleep(1.2)
for num in range(1, 26):
str1 = '/html/body/div[4]/div/div/div[1]/div[1]/div/div[2]/div[1]/ul/li[{}]/div/div/p[1]/a/text()'
str2 = '/html/body/div[4]/div/div/div[1]/div[1]/div/div[2]/div[1]/ul/li[{}]/div/div/p[2]/a/text()'
str3 = '/html/body/div[4]/div/div/div[1]/div[1]/div/div[2]/div[1]/ul/li[{}]/div/a/@href'
dishnames = html1.xpath(str1.format(num))
time.sleep(1.2)
for dishname in dishnames:
dishname1 = dishname
print('菜名:', end='')
print()
print(dishname.strip())
materials = html1.xpath(str2.format(num))
print('原材料:')
for material in materials:
list1.append(material)
list1.append(' ')
print(material, end=' ')
print()
print('详细烹饪流程URL:')
step_url = html1.xpath(str3.format(num))
for url in step_url:
newurl = 'https://www.xiachufang.com'+url
dishurl1 = newurl
print(newurl)
print('---------------------')
materials1 = ''.join(list1)
f.write(dishname1+' '+materials1+' ')
list1.clear()
insert(dishname1, materials1, dishurl1)
dish.commit()
数据库代码
CREATE DATAbase IF NOT EXISTS xiachufang
CREATE TABLE IF NOT EXISTS dish
(
dishid INT AUTO_INCREMENT,
dishname VARCHAr(300),
materials VARCHAr(100),
dishurl VARCHAr(100),
PRIMARY KEY(dishid)
)ENGINE=INNODB DEFAULT CHARSET=utf8
ALTER TABLE dish ConVERT TO CHARACTER SET utf8mb4
CREATE DATAbase IF NOT EXISTS xiachufang CREATE TABLE IF NOT EXISTS dish ( dishid INT AUTO_INCREMENT, dishname VARCHAr(300), materials VARCHAr(100), dishurl VARCHAr(100), PRIMARY KEY(dishid) )ENGINE=INNODB DEFAULT CHARSET=utf8 ALTER TABLE dish ConVERT TO CHARACTER SET utf8mb4
注意:将连接数据的数据换成自己的
后期发现:
后期运行的时候发现报了一个错误:
Incorrect string value: '\xF0\x9F\x94\xA5\xE5\x8F...' for column 'dish
排查发现,是因为在网站爬取的信息中包含表情包,而我们是将信息保存到数据库中的,但是数据库采用的是utf-8编码,是三字节为一个单位,表情包是采用四个字节为一个单位,因此报错。
解决方法:
(请看大佬链接)彻底解决:java.sql.SQLException: Incorrect string value: ‘xF0x9Fx92x94‘ for column ‘name‘ at row 1_小达哥的垃圾桶的博客-CSDN博客
词云制作
先上代码:
import os
import numpy as np
import jieba
from PIL import Image
from wordcloud import WordCloud
if __name__ == '__main__':
# 打开文本
with open('dish.text', 'r', encoding='utf-8') as f:
# 汉字词云不同于英文词云,需要将空格,换行等替换掉
text = f.read().replace(' ','').replace('n','').strip()
# 使用jieba库进行分割
text = jieba.cut(text)
text = ''.join(text)
# 这是要导入的模板样式
mask = np.array(Image.open('1.png'))
# mask接受图片蒙版
# font_path是字体,如果不加上的化,词云就识别不了汉字,呈现的将会是空格
# background_color是背景颜色,当时开可以设置长度和宽度等,大家可以根据自己的需求添加
wordcloud = WordCloud(
mask=mask,
font_path='HYNanGongTiJ-2.ttf',
background_color='white'
).generate(text)
# 保存
wordcloud.to_file('test.jpg')
友情提醒:
注意需要导入相关的库,不然会出错
我们直接在终端使用pip导入,可能会比较慢,因此推荐使用国内镜像
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple some-package



