The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
The goal of wordpiece.data is to provide stable, versioned data for use in the {wordpiece} tokenizer package.
You can install the released version of wordpiece.data from CRAN with:
install.packages("wordpiece.data")
And the development version from GitHub with:
# install.packages("remotes")
::install_github("macmillancontentscience/wordpiece.data") remotes
The datasets included in this package were retrieved from huggingface (specifically, cased and uncased). They were then processed using the {wordpiece} package. This is a bit circular, because this package is a dependency for the wordpiece package.
<- tempfile(fileext = ".txt")
vocab_txt download.file(
url = "https://huggingface.co/bert-base-cased/resolve/main/vocab.txt",
destfile = vocab_txt
)<- wordpiece::load_vocab(vocab_txt)
parsed_vocab <- paste0(
rds_filename paste(
"wordpiece",
"cased",
length(parsed_vocab),
sep = "_"
),".rds"
)saveRDS(parsed_vocab, here::here("inst", "rds", rds_filename))
unlink(vocab_txt)
<- tempfile(fileext = ".txt")
vocab_txt download.file(
url = "https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt",
destfile = vocab_txt
)<- wordpiece::load_vocab(vocab_txt)
parsed_vocab <- paste0(
rds_filename paste(
"wordpiece",
"uncased",
length(parsed_vocab),
sep = "_"
),".rds"
)saveRDS(parsed_vocab, here::here("inst", "rds", rds_filename))
unlink(vocab_txt)
You likely won’t ever need to use this package directly. It contains a function to load data used by {wordpiece}.
library(wordpiece.data)
head(wordpiece_vocab())
#> [1] "[PAD]" "[unused0]" "[unused1]" "[unused2]" "[unused3]" "[unused4]"
Please note that the wordpiece.data project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.
This is not an officially supported Macmillan Learning product.
Questions or comments should be directed to Jonathan Bratt (jonathan.bratt@macmillan.com) and Jon Harmon (jonthegeek@gmail.com).
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.