815 our first wordcloud，使用文本处理得到wordcloud实例

tutorial 6 part D&E
part D 给我们提供了一些以日期命名的.json文件，放在一个名为GuardianText的文件夹内，据说这些文件可以在学校给的链接上下载，但是可能需要账号，留学生应该都有权限，是不是所有人都能下载就不知道了。

part D 要求我们读入这些文件：

获取GuardianText文件夹所在磁盘位置读取该文件夹里的所有文件并存在一个dataframe上读取成功后输出 ‘Files read in successfully’查看该dataframe的列名将列‘webPublicationDate’设置为日期格式为列改名输出第一篇文章的内容创建一个新的dataframe，只存储文章内容输出前五个文章只保留‘text’,‘sectionName’,'Date’三项

随后要求我们写一个‘cleaning’函数，该函数可以有选择性地做数字去除、标点去除、小写、stop word 去除等预处理任务中的一项或多项。
使用该函数对dataframe的‘text’列做处理，然后输出前5个清洗过后的文章

Part E 创建wordcloud，即直观地展示词汇
我们需要

（下载并）导入wordcloud包将所有的语料放进一个string里，然后做预处理使用matplotlib包绘制wordcloud

下面开始讲解代码
首先是包：

# -*- coding: utf-8 -*-
"""
Created on Fri Mar  4 18:25:58 2022

@author: Pamplemousse
"""


import json
import requests
from os.path import join, exists
from datetime import date, timedelta
import time
from pathlib import Path
import os
import matplotlib.pyplot as plt
import pandas as pd

import nltk
import itertools
from nltk import word_tokenize
import string
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from wordcloud import WordCloud, STOPWORDS

很多包并没有用哈，但是老师引入了，咱就直接复制过来了

然后是做清理的函数

def clean_text(doc,#文件
               rm_punctuation = True,#默认去除标点
               rm_digits = True,#默认去除数字
               lemmatize = False,#默认不做词形还原
               norm_case = True,#默认要变成小写
               stem = False,#默认不做stemming
               rm_stopwords = True#默认去除stop words):
    #lemmatize和stemming的区别就是lemmatize是基于字典的，所以还原的都是正确的单词
    if(rm_digits == True):#去除数字（当条件成立时）
        table = str.maketrans({key: None for key in string.digits})
        doc = str(doc).translate(table)
    if(norm_case == True):#改为小写
        doc = doc.lower()
    if(rm_punctuation == True):#去除标点
        table = str.maketrans({key: None for key in string.punctuation})
        doc = doc.translate(table)
    if(rm_stopwords == True):#去除stop words
        words = " ".join([i for i in doc.split() if i not in default_stopwords])
    else:#如果不去除stop words，就直接分割成单词，这里保证了文章一定会被分割
        words = " ".join([i for i in doc.split()])
    if(lemmatize == True):#对单词做lemmatize
        words = " ".join(lemma.lemmatize(word) for word in words.split())
    if(stem == True):#对单词做stem
        words = " ".join(porter_stemmer.stem(word) for word in words.split())
    return words#返回单词

下面是读入文章的要求：

# In[1]:

#获取GuardianText文件夹所在磁盘位置
dataDirectory = os.path.realpath('Desktop/King/815/Tutorial6/Tutorial5/GuardianText')
dateColName = 'Date'
textColumnName = 'text'
cleanTextColName = 'cleanedText'
df = pd.Dataframe()#定义一个空的dataframe

#读取该文件夹里的每一个文件
for textFilePath in os.listdir(dataDirectory):
    if textFilePath.endswith(".json"):
        #将当前读取的文件加入df里
        df = df.append(pd.read_json(dataDirectory/Path(textFilePath)))
#读取成功，输出‘Files read in successfully'
print('Files read in successfully')
#查看df列名
print(df.columns)
#输出1，见文章

# In[2]:

#将列'webPublicationDate'设置为日期格式
df['webPublicationDate'] = pd.to_datetime(df['webPublicationDate'])

#将'webPublicationDate'列改名为'Date'
df = df.rename(columns = {'webPublicationDate': dateColName})
#dateColName = 'Date'

#输出第一篇文章的内容
print(df['fields'].iloc[0]['bodyText'])
#输出2，见文章

#只存储文章的的内容
df[textColumnName] = df['fields'].apply(lambda x: x['bodyText'])
df = df.drop('fields', axis = 1)

#输出前五个文章
print(df['text'].head())
#输出3，见文章

#只保留'text','sectionName','Date'三项
df = df[['text','sectionName','Date']]

df
#输出4，见文章

输出1的结果是：
Index([‘webUrl’, ‘sectionName’, ‘webTitle’, ‘fields’, ‘sectionId’, ‘apiUrl’,
‘isHosted’, ‘webPublicationDate’, ‘type’, ‘id’],
dtype=‘object’)
可以看出，每一篇文章本来有以上10个内容

输出2的结果是：

这就是我们GuardianText文件夹里第一个.json文件里面存储的文章内容

输出3的结果是：

输出4的结果是：

我们的GuardianText文件夹里有7个.json文件，每个.json文件里都有两个不同文章的信息，所以一共有14个文章，而我们只保留了三个列，所以现在df是14*3的矩阵。

随后是使用clean_text对df进行清理，并输出前5个清理后的文件

default_stopwords = nltk.corpus.stopwords.words('english')#首先定义stopwords

lemma = WordNetLemmatizer()#定义lemmatize的方法
porter_stemmer = PorterStemmer()#stemming的方法

#在df里新加一列，列名为‘cleanedText’，前面有对cleanTextColName赋值
df[cleanTextColName] = df[textColumnName].apply(lambda x: clean_text(x, stem = False, lemmatize = False))
#去除标点、去除数字、变成小写、去除stopwords，不做lemmatize和stemming

#输出前五篇文章
print(df[cleanTextColName].iloc[:5])

得到如下输出
0 torrential rain flash floods south weather apo…
1 voters areas uk claimed turned away polling bo…
0 “that’s think declare glastonbury independent …
1 we’re closing live blog tonight here’s summary…
0 it’s scots want review constitutional arrangem…
Name: cleanedText, dtype: object

可以看出来双引号没被当成标点符号，但是tonight后面的句号是被除掉了的，单引号也没有被除去。

part D的所有任务完成了。

下面是part E 的任务

#获取文章的数量
num_docs = len(df)
#获取每篇文章的长度
count = df['cleanedText'].str.split().str.len()
#P.S.原文用的text，但是我想清都清洗过了，为什么不用清洗过的数据呢

#在df中加一列，列名为‘count’，列内容为每篇文章的长度
df['count'] = count
#初始化comment_words
comment_words = ' '
stopwords = set(STOPWORDS)

for val in df.cleanedText:#遍历每一篇文章
#为了保持统一，这里也用的清洗后的数据
    
    val = str(val)
    #将文章分裂成单词
    tokens = val.split()
    
    #这里不再进行小写化，因为清洗的时候已经做过了
    
    for words in tokens:
    #将每一个单词加入comment_words里
        comment_words = comment_words + words + ' '

#设置wordcloud
wordcloud = WordCloud(width = 800, height = 800, background_color = 'white',
                      stopwords = stopwords, min_font_size = 10).generate(comment_words)

#绘制wordcloud
plt.figure(figsize = (8,8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()

然后我就得到了这样的wordcloud图

P.S.每次跑出来的图不尽相同，但是那几个比较明显的还是会比较明显

今天下班了！下周再见

815 our first wordcloud，使用文本处理得到wordcloud实例

Python相关栏目本月热门文章