The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
tokenize_tweets()
function, which is no
longer supported.tokenize_ptb()
function for Penn Treebank
tokenizations (@jrnold) (#12).chunk_text()
to split long documents
into pieces (#30).tokenize_tweets()
preserves usernames,
hashtags, and URLS (@kbenoit) (#44).stopwords()
function has been removed in favor of
using the stopwords package (#46).tif
package. (#49)tokenize_skip_ngrams
has been improved to generate
unigrams and bigrams, according to the skip definition (#24).tokenizers
supports (@ironholds) (#26).tokenize_skip_ngrams
now supports stopwords (#31).NA
consistently (#33).tokenize_words()
gains arguments to preserve or strip
punctuation and numbers (#48).tokenize_skip_ngrams()
and
tokenize_ngrams()
to return properly marked UTF8 strings on
Windows (@patperry)
(#58).tokenize_tweets()
now removes stopwords prior to
stripping punctuation, making its behavior more consistent with
tokenize_words()
(#76).tokenize_character_shingles()
tokenizer.tokenize_words()
and
tokenize_word_stems()
.These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.