The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
The tidytext
package is one of the more popular natural language processing packages in R’s ecosystem. It follows conventions and syntax of the “tidyverse.”
You may prefer to use tidytext
for a couple of reasons. First, tidytext
has its own philosophy and syntax for handling text, particularly at early stages. You may be more familiar or comfortable with this approach. Second, tidytext
does, theoretically, offer some more flexibility in options creating DTMs or TCMs. This early stage is critical to successful topic modeling.
See Text Mining with R: A Tidy Approach for more details about tidytext.
What follows is a short script combining tidytext
with textmineR
. Initial data curation and DTM creation is done with tidytext
. Topic modeling is done with textmineR
and the outputs are re-formatted in the flavor of tidytext
’s “tidiers” for other topic models.
################################################################################
# Example: Using tidytext with textmineR
################################################################################
library(tidytext)
library(textmineR)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:igraph':
#>
#> as_data_frame, groups, union
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
#>
#> Attaching package: 'tidyr'
#> The following object is masked from 'package:igraph':
#>
#> crossing
#> The following objects are masked from 'package:Matrix':
#>
#> expand, pack, unpack
# load documents in a data frame
docs <- textmineR::nih_sample
# tokenize using tidytext's unnest_tokens
tidy_docs <- docs %>%
select(APPLICATION_ID, ABSTRACT_TEXT) %>%
unnest_tokens(output = word,
input = ABSTRACT_TEXT,
stopwords = c(stopwords::stopwords("en"),
stopwords::stopwords(source = "smart")),
token = "ngrams",
n_min = 1, n = 2) %>%
count(APPLICATION_ID, word) %>%
filter(n>1) #Filtering for words/bigrams per document, rather than per corpus
tidy_docs <- tidy_docs %>% # filter words that are just numbers
filter(! stringr::str_detect(tidy_docs$word, "^[0-9]+$"))
# turn a tidy tbl into a sparse dgCMatrix for use in textmineR
d <- tidy_docs %>%
cast_sparse(APPLICATION_ID, word, n)
# create a topic model
m <- FitLdaModel(dtm = d,
k = 20,
iterations = 200,
burnin = 175)
# below is equivalent to tidy_beta <- tidy(x = m, matrix = "beta")
tidy_beta <- data.frame(topic = as.integer(stringr::str_replace_all(rownames(m$phi), "t_", "")),
m$phi,
stringsAsFactors = FALSE) %>%
gather(term, beta, -topic) %>%
tibble::as_tibble()
# below is equivalent to tidy_gamma <- tidy(x = m, matrix = "gamma")
tidy_gamma <- data.frame(document = rownames(m$theta),
m$theta,
stringsAsFactors = FALSE) %>%
gather(topic, gamma, -document) %>%
tibble::as_tibble()
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.