The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
The goal of wordpiece is to allow for easy text tokenization using a wordpiece vocabulary.
You can install the released version of wordpiece from CRAN with:
install.packages("wordpiece")
And the development version from GitHub with:
# install.packages("devtools")
::install_github("macmillancontentscience/wordpiece") devtools
This package can be used to tokenize text for modeling. A common usecase would be to tokenize all text in a data.frame or other tibble.
library(wordpiece)
library(dplyr, warn.conflicts = FALSE)
<- tibble(
df_tokenized text = c(
"I like tacos.",
"I like apples with cheese.",
"The unaffable coder wrote incorrect examples."
)%>%
) mutate(
tokens = wordpiece_tokenize(text)
)
df_tokenized#> # A tibble: 3 x 2
#> text tokens
#> <chr> <list>
#> 1 I like tacos. <dbl [5]>
#> 2 I like apples with cheese. <dbl [6]>
#> 3 The unaffable coder wrote incorrect examples. <dbl [10]>
$tokens[[1]]
df_tokenized#> i like ta ##cos .
#> 1045 2066 11937 13186 1012
Please note that the wordpiece project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.
This is not an officially supported Macmillan Learning product.
Questions or comments should be directed to Jonathan Bratt (jonathan.bratt@macmillan.com) and Jon Harmon (jonthegeek@gmail.com).
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.