library(petro.One)
library(tm)
library(tibble)
use_example(1)
p1 <- onepetro_page_to_dataframe("neural_network-s0000r1000.html")
p2 <- onepetro_page_to_dataframe("neural_network-s1000r1000.html")
p3 <- onepetro_page_to_dataframe("neural_network-s2000r1000.html")
p4 <- onepetro_page_to_dataframe("neural_network-s3000r1000.html")
nn_papers <- rbind(p1, p2, p3, p4)
nn_papers
## # A tibble: 3,788 x 6
## book_title paper_id dc_type authors year source
## <fct> <fct> <fct> <chr> <int> <fct>
## 1 Neural Networks An~ SEG-2002-~ conferen~ Russell, Brian, H~ 2002 SEG
## 2 Deconvolution Usin~ SEG-1996-~ conferen~ Essenreiter, Robe~ 1996 SEG
## 3 Neural Network Sta~ SEG-1992-~ conferen~ Schmidt, Jumndyr,~ 1992 SEG
## 4 Hydrocarbon Predic~ SEG-2000-~ conferen~ Xiangjun, Zhang, ~ 2000 SEG
## 5 Higher-Order Neura~ SPE-27905~ conferen~ Kumoluyi, A.O., I~ 1994 SPE
## 6 Implicit Approxima~ SPE-11430~ journal-~ Li, Dao-lun, Univ~ 2009 SPE
## 7 Multiple Attenuati~ SEG-2000-~ conferen~ Karrenbach, M., U~ 2000 SEG
## 8 Conductive fractur~ ARMA-95-0~ conferen~ Thomas, Andrew L.~ 1995 ARMA
## 9 Neural networks ap~ SEG-2017-~ conferen~ Canning, Anat, Pa~ 2017 SEG
## 10 Artificial Neural ~ SPE-17127~ conferen~ Lind, Yuliya B., ~ 2014 SPE
## # ... with 3,778 more rows
## [1] 3788
## [1] 3788
## [1] 3788
tdm <- TermDocumentMatrix(vdocs)
tdm.matrix <- as.matrix(tdm)
tdm.rs <- sort(rowSums(tdm.matrix), decreasing=TRUE)
tdm.df <- data.frame(word = names(tdm.rs), freq = tdm.rs, stringsAsFactors = FALSE)
as.tibble(tdm.df) # prevent long printing of dataframe
## Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.
## # A tibble: 5,835 x 2
## word freq
## <chr> <dbl>
## 1 using 878
## 2 neural 642
## 3 reservoir 564
## 4 data 473
## 5 artificial 368
## 6 seismic 363
## 7 network 356
## 8 analysis 325
## 9 prediction 321
## 10 networks 295
## # ... with 5,825 more rows
There are 5835 words under analysis. We will focus our attention on those papers were the frequency is greater than 50 occurrances.
library(wordcloud)
set.seed(1234)
wordcloud(words = tdm.df$word, freq = tdm.df$freq, min.freq = 50,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
Note that in the word cloud there are words of common use such as using, use, new, approach and case. These words are not necessarily technical enough to improve where the papers we are analyzing are focusing. In the next example, we will build our own custom stopwords to prevent these words from showing.