Make a word cloud

Alfonso R. Reyes

2019-01-10

Load 2918 papers metadata

library(petro.One)
library(tm)
library(tibble)

use_example(1)

p1 <- onepetro_page_to_dataframe("neural_network-s0000r1000.html")
p2 <- onepetro_page_to_dataframe("neural_network-s1000r1000.html")
p3 <- onepetro_page_to_dataframe("neural_network-s2000r1000.html")
p4 <- onepetro_page_to_dataframe("neural_network-s3000r1000.html")

nn_papers <- rbind(p1, p2, p3, p4)
nn_papers
## # A tibble: 3,788 x 6
##    book_title          paper_id   dc_type   authors             year source
##    <fct>               <fct>      <fct>     <chr>              <int> <fct> 
##  1 Neural Networks An~ SEG-2002-~ conferen~ Russell, Brian, H~  2002 SEG   
##  2 Deconvolution Usin~ SEG-1996-~ conferen~ Essenreiter, Robe~  1996 SEG   
##  3 Neural Network Sta~ SEG-1992-~ conferen~ Schmidt, Jumndyr,~  1992 SEG   
##  4 Hydrocarbon Predic~ SEG-2000-~ conferen~ Xiangjun, Zhang, ~  2000 SEG   
##  5 Higher-Order Neura~ SPE-27905~ conferen~ Kumoluyi, A.O., I~  1994 SPE   
##  6 Implicit Approxima~ SPE-11430~ journal-~ Li, Dao-lun, Univ~  2009 SPE   
##  7 Multiple Attenuati~ SEG-2000-~ conferen~ Karrenbach, M., U~  2000 SEG   
##  8 Conductive fractur~ ARMA-95-0~ conferen~ Thomas, Andrew L.~  1995 ARMA  
##  9 Neural networks ap~ SEG-2017-~ conferen~ Canning, Anat, Pa~  2017 SEG   
## 10 Artificial Neural ~ SPE-17127~ conferen~ Lind, Yuliya B., ~  2014 SPE   
## # ... with 3,778 more rows
get_papers_count("neural_network-s0000r1000.html")
## [1] 3788
get_papers_count("neural_network-s1000r1000.html")
## [1] 3788
get_papers_count("neural_network-s3000r1000.html")
## [1] 3788

Convert and clean document for text mining

vdocs <- VCorpus(VectorSource(nn_papers$book_title))
vdocs <- tm_map(vdocs, content_transformer(tolower))      # to lowercase
vdocs <- tm_map(vdocs, removeWords, stopwords("english")) # remove stopwords

Summary table with words frequency

tdm <- TermDocumentMatrix(vdocs)

tdm.matrix <- as.matrix(tdm)
tdm.rs <- sort(rowSums(tdm.matrix), decreasing=TRUE)
tdm.df <- data.frame(word = names(tdm.rs), freq = tdm.rs, stringsAsFactors = FALSE)
as.tibble(tdm.df)                          # prevent long printing of dataframe
## Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.
## # A tibble: 5,835 x 2
##    word        freq
##    <chr>      <dbl>
##  1 using        878
##  2 neural       642
##  3 reservoir    564
##  4 data         473
##  5 artificial   368
##  6 seismic      363
##  7 network      356
##  8 analysis     325
##  9 prediction   321
## 10 networks     295
## # ... with 5,825 more rows

There are 5835 words under analysis. We will focus our attention on those papers were the frequency is greater than 50 occurrances.

Word cloud with words that occur at least 50 times

library(wordcloud)

set.seed(1234)
wordcloud(words = tdm.df$word, freq = tdm.df$freq, min.freq = 50,
          max.words=200, random.order=FALSE, rot.per=0.35,
          colors=brewer.pal(8, "Dark2"))

Note that in the word cloud there are words of common use such as using, use, new, approach and case. These words are not necessarily technical enough to improve where the papers we are analyzing are focusing. In the next example, we will build our own custom stopwords to prevent these words from showing.