栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 软件开发 > 后端开发 > Python

R绘制文章标题词云

Python 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

R绘制文章标题词云

数据和问题

见:把tcga大计划的CNS级别文章标题画一个词云

词云绘制

关于词云绘制较详细的步骤可以参考,之前的博文 https://blog.csdn.net/shy_321/article/details/120567111

2018的TCGA的泛癌项目论文 获取数据

文章的标题都在该网页里,
(https://www.cell.com/pb-assets/consortium/pancanceratlas/pancani3/index.html )
可以选择使用复制粘贴的方式(数据不是很多) 或者, 直接解析该网页html的内容,提取标题。
这里采用网页解析的方式。

install.packages("rvest") ###rvest 是R中的一个爬虫包

library(rvest)

### 读取网页, 本来 read_html是可以直接输入url的
### 但是,可能网络原因,或者其他原因把,并不能成功读取
### 所以手动在浏览器上打开 https://www.cell.com/pb-assets/consortium/pancanceratlas/pancani3/index.html ,
### 然后保存网页为tcga_cancer.html, 在读取
html <- read_html("tcga_cancer.html")

#### 这两行,是为了得到标题内容
sections <- html %>% html_nodes("section")
titles <- sections[2:4]  %>%
    html_nodes("ul > li.journal + li > a")  %>% html_text()

titles    
---------------------------------
> titles
 [1] "Cell-of-Origin Patterns Dominate the Molecular Classification of 10,000 Tumors from 33 Types of Cancer"                                            
 [2] "Machine Learning Identifies Stemness Features Associated with oncogenic Dedifferentiation"                                                         
 [3] "A Comprehensive Pan-Cancer Molecular Study of Gynecologic and Breast Cancers"                                                                      
 [4] "Comparative Molecular Analysis of Gastrointestinal Adenocarcinomas"                                                                                
 [5] "Genomic, Pathway Network, and Immunologic Features Distinguishing Squamous Carcinomas"                                                             
 [6] "The Cancer Genome Atlas Comprehensive Molecular Characterization of Renal Cell Carcinoma"                                                          
 [7] "Perspective on oncogenic Processes at the End of the Beginning of Cancer Genomics"                                                                 
 [8] "Pathogenic Germline Variants in 10,389 Adult Cancers"                               
 ....................

可能不熟悉的爬虫同学可能会有些迷惑。学爬虫之前,可能需要先学下html。感兴趣的同学可以去另外学习下(网上挺多资料教程的)。

library(tm)
library(wordcloud2)

txt_source <- VectorSource(titles)
docs <- tm::Corpus(txt_source)

docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stemdocument)
docs <- tm_map(docs, removeWords, stopwords("english"))


dtm <- TermdocumentMatrix(docs, list(tolower = F))
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v), freq=v)
head(d, 10)


wordcloud2(d, size = 0.4,shape = 'star')

2020的Nature及其子刊的22篇全基因组的泛癌分析
html <- read_html("https://www.nature.com/collections/afdejfafdb/")

titles <- html %>% 
	html_nodes("articletitle") %>% 
	html_text()


txt_source <- VectorSource(titles)

docs <- tm::Corpus(txt_source)


docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stemdocument)
docs <- tm_map(docs, removeWords, stopwords("english"))


dtm <- TermdocumentMatrix(docs, list(tolower = F))
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v), freq=v)
head(d, 10)


library(wordcloud2)

wordcloud2(d, size = 0.4,shape = 'star')

参考

https://jtr13.github.io/cc19/web-scraping-using-rvest.html

转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/295794.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号