The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
Tokenizers break text into pieces that are more usable by machine learning models. While writing wordpiece and morphemepiece, we found that many steps were shared between those packages. This package provides those shared steps.
You can install the released version of piecemaker from CRAN with:
install.packages("piecemaker")
And the development version from GitHub with:
# install.packages("devtools")
::install_github("macmillancontentscience/piecemaker") devtools
{piecemaker} helps to prepare text for tokenization. For example, it can help you clean out strange encoding, whitespace, and special characters.
library(piecemaker)
<- " This is a \n\nfa\xE7ile\n\n example.\n"
piece1 # Specify encoding so this example behaves the same on all systems.
Encoding(piece1) <- "latin1"
<- paste(
example_text
piece1,"It has the bell character, \a, and the replacement character,",
intToUtf8(65533)
)prepare_text(example_text)
#> [1] "This is a facile example . It has the bell character , , and the replacement character ,"
prepare_text(example_text, squish_whitespace = FALSE)
#> [1] " This is a facile example . It has the bell character , , and the replacement character , "
prepare_text(example_text, remove_control_characters = FALSE)
#> [1] "This is a facile example . It has the bell character , \a , and the replacement character ,"
prepare_text(example_text, remove_replacement_characters = FALSE)
#> [1] "This is a facile example . It has the bell character , , and the replacement character , �"
prepare_text(example_text, remove_diacritics = FALSE)
#> [1] "This is a façile example . It has the bell character , , and the replacement character ,"
Please note that the piecemaker project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.
This is not an officially supported Macmillan Learning product.
Questions or comments should be directed to Jonathan Bratt (jonathan.bratt@macmillan.com) and Jon Harmon (jonthegeek@gmail.com).
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.