The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
Unsupervised text tokenizer allowing to perform byte pair encoding and unigram modelling. Wraps the 'sentencepiece' library <https://github.com/google/sentencepiece> which provides a language independent tokenizer to split text in words and smaller subword units. The techniques are explained in the paper "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing" by Taku Kudo and John Richardson (2018) <doi:10.18653/v1/D18-2012>. Provides as well straightforward access to pretrained byte pair encoding models and subword embeddings trained on Wikipedia using 'word2vec', as described in "BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages" by Benjamin Heinzerling and Michael Strube (2018) <http://www.lrec-conf.org/proceedings/lrec2018/pdf/1049.pdf>.
Version: | 0.2.3 |
Depends: | R (≥ 2.10) |
Imports: | Rcpp (≥ 0.11.5), stats |
LinkingTo: | Rcpp |
Suggests: | tokenizers.bpe, word2vec (≥ 0.2.0) |
Published: | 2022-11-13 |
DOI: | 10.32614/CRAN.package.sentencepiece |
Author: | Jan Wijffels [aut, cre, cph] (R wrapper), BNOSAC [cph] (R wrapper), Google Inc. [ctb, cph] (Files at src/sentencepiece/src (Apache License, Version 2.0), The Abseil Authors [ctb, cph] (Files at src/third_party/absl (Apache License, Version 2.0), Google Inc. [ctb, cph] (Files at src/third_party/protobuf-lite (BSD-3 License)), Kenton Varda (Google Inc.) [ctb, cph] (Files at src/third_party/protobuf-lite: coded_stream.cc, extension_set.cc, generated_message_util.cc, generated_message_util.cc, message_lite.cc, repeated_field.cc, wire_format_lite.cc, zero_copy_stream.cc, zero_copy_stream_impl_lite.cc, google/protobuf/extension_set.h, google/protobuf/generated_message_util.h, google/protobuf/wire_format_lite.h, google/protobuf/wire_format_lite_inl.h, google/protobuf/message_lite.h, google/protobuf/repeated_field.h, google/protobuf/io/coded_stream.h, google/protobuf/io/zero_copy_stream_impl_lite.h, google/protobuf/io/zero_copy_stream.h, google/protobuf/stubs/common.h, google/protobuf/stubs/hash.h, google/protobuf/stubs/once.h, google/protobuf/stubs/once.h.org (BSD-3 License)), Sanjay Ghemawat (Google Inc.) [ctb, cph] (Design of files at src/third_party/protobuf-lite: coded_stream.cc, extension_set.cc, generated_message_util.cc, generated_message_util.cc, message_lite.cc, repeated_field.cc, wire_format_lite.cc, zero_copy_stream.cc, zero_copy_stream_impl_lite.cc, google/protobuf/extension_set.h, google/protobuf/generated_message_util.h, google/protobuf/wire_format_lite.h, google/protobuf/wire_format_lite_inl.h, google/protobuf/message_lite.h, google/protobuf/repeated_field.h, google/protobuf/io/coded_stream.h, google/protobuf/io/zero_copy_stream_impl_lite.h, google/protobuf/io/zero_copy_stream.h (BSD-3 License)), Jeff Dean (Google Inc.) [ctb, cph] (Design of files at src/third_party/protobuf-lite: coded_stream.cc, extension_set.cc, generated_message_util.cc, generated_message_util.cc, message_lite.cc, repeated_field.cc, wire_format_lite.cc, zero_copy_stream.cc, zero_copy_stream_impl_lite.cc, google/protobuf/extension_set.h, google/protobuf/generated_message_util.h, google/protobuf/wire_format_lite.h, google/protobuf/wire_format_lite_inl.h, google/protobuf/message_lite.h, google/protobuf/repeated_field.h, google/protobuf/io/coded_stream.h, google/protobuf/io/zero_copy_stream_impl_lite.h, google/protobuf/io/zero_copy_stream.h (BSD-3 License)), Laszlo Csomor (Google Inc.) [ctb, cph] (Files at src/third_party/protobuf-lite: io_win32.cc, google/protobuf/stubs/io_win32.h (BSD-3 License)), Wink Saville (Google Inc.) [ctb, cph] (Files at src/third_party/protobuf-lite: message_lite.cc, google/protobuf/wire_format_lite.h, google/protobuf/wire_format_lite_inl.h, google/protobuf/message_lite.h (BSD-3 License)), Jim Meehan (Google Inc.) [ctb, cph] (Files at src/third_party/protobuf-lite: structurally_valid.cc (BSD-3 License)), Chris Atenasio (Google Inc.) [ctb, cph] (Files at src/third_party/protobuf-lite: google/protobuf/wire_format_lite.h (BSD-3 License)), Jason Hsueh (Google Inc.) [ctb, cph] (Files at src/third_party/protobuf-lite: google/protobuf/io/coded_stream_inl.h (BSD-3 License)), Anton Carver (Google Inc.) [ctb, cph] (Files at src/third_party/protobuf-lite: google/protobuf/stubs/map_util.h (BSD-3 License)), Maxim Lifantsev (Google Inc.) [ctb, cph] (Files at src/third_party/protobuf-lite: google/protobuf/stubs/mathlimits.h (BSD-3 License)), Susumu Yata [ctb, cph] (Files at src/third_party/darts_clone (BSD-3 License), Daisuke Okanohara [ctb, cph] (File src/third_party/esaxx/esa.hxx (MIT License)), Yuta Mori [ctb, cph] (File src/third_party/esaxx/sais.hxx (MIT License)), Benjamin Heinzerling [ctb, cph] (Files data/models/nl.wiki.bpe.vs1000.d25.w2v.txt, data/models/nl.wiki.bpe.vs1000.d25.w2v.bin and data/models/nl.wiki.bpe.vs1000.model (MIT License)) |
Maintainer: | Jan Wijffels <jwijffels at bnosac.be> |
License: | MPL-2.0 |
URL: | https://github.com/bnosac/sentencepiece |
NeedsCompilation: | yes |
Materials: | README NEWS |
In views: | NaturalLanguageProcessing |
CRAN checks: | sentencepiece results |
Reference manual: | sentencepiece.pdf |
Package source: | sentencepiece_0.2.3.tar.gz |
Windows binaries: | r-devel: sentencepiece_0.2.3.zip, r-release: sentencepiece_0.2.3.zip, r-oldrel: sentencepiece_0.2.3.zip |
macOS binaries: | r-release (arm64): sentencepiece_0.2.3.tgz, r-oldrel (arm64): sentencepiece_0.2.3.tgz, r-release (x86_64): sentencepiece_0.2.3.tgz, r-oldrel (x86_64): sentencepiece_0.2.3.tgz |
Old sources: | sentencepiece archive |
Reverse suggests: | textrecipes |
Please use the canonical form https://CRAN.R-project.org/package=sentencepiece to link to this page.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.