The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
split()
sets type of subcorpus
to
NA
, causing an error if another split is performed.
Fixed.split()
throwed misleading error message if
s_attribute
not existing. The error message is now telling
#242.split()
was not implemented if s_attribute was child.
Done #243.size()
for corpus
objects
for scenario of nested s-attributes addressed #231.enrich()
for subcorpus_bundle
objects
(returning partition_bundle
now) #224.subset()
implemented for subcorpus_bundle
obejcts #234.setAs()
-method from slice
to “AnnotatedPlainTextDocument” that would prevent using GERMAPARLMINI
as sample data.decode()
can return ‘AnnotatedPlainTextDocument’
from NLP package.as(x, "AnnotatedPlainTextDocument")
not
available any more.decode()
has new argument “stoplist” to drop
terms from ‘AnnotatedPlainTextDocument’. Unused for other return
values.get_template()
,
examples added.show()
-method for corpus
objects gives
an information whether a template is available.as.markdown()
that would prevent
fulltext display for non-parliamentary-protocol documents.tooltips()
has new argument fmt
to
provide flexibility to assign tooltips based on corpus positions.href()
to add hypertext references to
fulltext output.read()
has new argument annotation
to get values for arguments highlight
,
tooltips
and href
from a subcorpus
object.format()
method used internally to produce output
does not drop s-attributes ending on “_id” any more #253.progress
is FALSE for
the hits()
method for character class objects, as a matter
of consistency #252.split()
for corpus
objects.decode()
method for subcorpus
objects
is now able to process nested corpora. Performance gain for all
scenarios.as.TermDocumentMatrix()
for bundle
objects
speed ups instantiation of simple_triplet_matrix
.s_attributes()
for bundle
objects
is implemented much more efficiently.get_token_stream()
and ngrams()
have new argument vocab
to pass in alternative dictionary.
Envisaged usage is to efficiently use pruned vocabulary for decoding the
token stream.ngrams()
for list
objects.
Serves as worker for ngrams()
-method for
partition_bundle
objects.corpus()
.get_token_stream()
for numeric
input has new argument registry
to optionally specify
registry directory.count()
for subcorpus
objects did
not pass value of argument verbose
to cpos()
,
resulting in potentially unwanted verbosity. Fixed.subcorpus
using
subset()
-method kept strucs for nested attributes but
assigned ancestor s-attribute to slot “s_attribute_strucs”, resulting in
false counts, for example. Fixed.split()
-method for subcorpus
objects
was not implemented correctly for descendent attributes without values,
so that getting subcorpora with sentences in a subcorpus would have
wrong result. Fixed.values
of method split()
for
corpus
objects did not process value FALSE
to
split corpus by s-attribute without values #263. Fixed.s_attributes()
-method for context
objects. Returns s-attribute values for the matches for query in context
object.hits()
has new argument decoce
. If
FALSE
, the strucs for are not decoded.s_attributes()
-method for expression
assigns types of vectors matched against as names if possible.subset
for corpus
and subset
objects will use integer struc values for subsetting, if integer values
are passed in logical expression.enrich()
-method for
partition_bundle
objects #225.as.TermDocumentMatrix()
for
partition_bundle
and bundle
, to improve
performance.partition_bundle()
-method for partition
objects (more efficient instantiation of S4 objects).split()
-method for
subcorpus
objects.store()
and mail()
have
finally been removed from the package.sample()
method for bundle
objects
(and objects inheriting from the bundle
class) did not yet
use the new convention to use single square brackets (not double
brackets) for extracting a subset from the bundle
. Fixed
#236.ngrams()
method for
partition_bundle
objects, introducing more efficient data
handling, vectorization and parallelization.get_token_stream()
for partition_bundle
failed if all docs have equal length (mapply()
issue).
Fixed.as.DocumentTermMatrtix()
for large
corpora significantly improved for handlung large corpora.$
-method for corpus
is now used for
accessing corpus properties, replacing previous usage to inspect
s-attributes.partition_bundle()
-method for context
class objects has improved verbosity now and telling progress
messages.capitalize()
for uppercasing first
letter of elements in a character vector.trim()
-method for classes
DocumentTermMatrix
and TermDocumentMatrix
has
been updated. Arguments termsToKeep
, and
docsToDrop
have been deprecated, argument
termsToDrop
is deprecated and replaced by
terms_to_drop
and docsToKeep
is deprecated and
replaced by docs_to_keep
. New arguments
min_count
and min_doc_length
are introduced to
drop rare terms and short documents, respectively. The purpose of
redesigning the trim()
-method is to make it more useful for
preparing matrices for topic modelling.subset()
for corpus
and
subcorpus
objects will now process indication of
s-attribute without value, so that subsetting corpora for s-attributes
without values is now possible.split()
for subcorpus
objects will
now also work if s_attribute
for splitting is not a sibling
of the s-attribute the subcorpus is based on.as.speeches()
for subcorpus
objects
refactored to work with nested scenario.s_attributes()
will return NA
if
s-attribute does not have values #234.hits()
-method for partition_bundle
objects
passes argument p_attribute
to cpos()
#239.use()
returns TRUE
, if loading corpus in
package was successful, or FALSE
if not. Previously, the
function aborted with an error, or returned NULL
.subset()
would loose specific
subcorpus class (such as “plpr_subcorpus”). Fixed.html()
for subcorpus
reconstructs
meta
equivalent to read()
for
subcorpus
objects.corpus
class throughout is an opportunity to
keep the corpus ID together with the registry directory of a corpus. And
as we are able now to handle corpora defined in different registry
files, the temporary registry directory is not necessary any more. It
still exists, yet only for temporary corpora and corpora that are
described by registry files that cannot be modified, i.e. corpora
shipped in packages. The test corpus of the polmineR package is an
important respective scenario.get_token_stream()
now has an argument
min_length
.registry_*()
functions are superseded by
RcppCWB::corpus_*
functions and throw a warning that they
are deprecated.use(pkg = "RcppCWB", corpus = "REUTERS")
to make the
REUTERS corpus available.size()
works for
partition
/subcorpus
with
s-attribute
that is a child of the s-attribute the object
is based on #216.trim()
-method for context
objects has
a new argument fn
for supplying a (trimming) function to be
applied all match contexts.s_attribute_date
is stated explicitly in all examples.size()
has been refactored to work with nested
corpora.encoding()
and replace method
encoding<-
are defined for call
and
quosure
objects to get and adjust the encoding, replacing a
previously unexported function .recode_call()
.subset()
methods for corpus
and
subcorpus
objects now handle expressions for subsetting as
quosures, laying the ground to program against subset(), see respective
update of the examples, #212.bundle
objects with single
square brackets is developed now. Indexing with double brackets,
suppling multiple values for i
is deprecated. The aim is a
consistent behavior that a bundle
indexed by [
will always return a bundle
, and indexing with
[[
always gets a single object from the list of objects.
#214use()
function now has an additional argument
corpus
to specify which corpus from a package shall be
loaded (#138).get_token_stream()
-method for
partition_bundle
objects is more memory efficient (no
exhaustion for big corpora) and faster.split()
-method
for corpus
objects.split()
-method for corpus
objects
offers progress bar.as.speeches()
for corpus
objects has new
argument subset
, offering a significantly faster approach
than the method for subcorpus
objects in many cases.size()
method will return NA
and issue
a telling warning if the slot corpus
and
registry_dir
of the corpus
object are not
filled #222.get_token_stream()
will return list of
integer
values if decode
is TRUE
(#213).trim()
on a context
object
using arguments positivelist
or negativelist
,
the count
slot as reported by length
was not
updated. Fixed. (#220)enrich()
method for context
objects
has a new argument stat
for creating / updating the
data.table
in the slot stat
.subset()
for subcorpus
objects has
been debugged to work with nested corpora.polmineR.mdsub
configures substitutions that
are applied on markdown documents to prevent presence of characters that
would be misinterpreted as formatting instructions. Fixes #166.check_cqp_query()
now include a
hint that argument check
can be used to omit checking the
CQP syntax to prevent false positives. Addresses #171.cooccurrences()
(and
context()
) to process more than one p-attribute has been
lost temporarily. Fixed. #208.hits()
method for
partition
objects #215.trim()
on a context
object
using arguments positivelist
or negativelist
,
the count statistics reported in the stat
slot were not
updated. Fixed. (#220)kwic
object #218.subset()
would not work reliably with argument
regex
if more than one expression is passed #212.
Fixed.terms()
did not work for subcorpus
objects. Fixed. #209as.speeches()
on a
subcorpus
, the date may have been missing from the object
names. Fixed. #219minNchar
in the
noise()
method would work exactly the way opposite to the
way intended #211.registry_dir
of a
cooccurrences_bundle
derived from a
partition_bundle
was not filled, resulting in an error of
the show()
-method for the
cooccurrences_bundle
. Fixed #222.cooccurrences()
method now
includes example code for creating a table using
DT::datatable()
with buttons for exporting tables (to
Excel, for instance).dispersion()
method now accepts an argument
fill
, a logical
value to explicitly control
whether (#160) zero matches for a value of a structural attribute should
be reported. The performance of adding columns (requred only if two
structural attributes are provided) is improved substantially by using
the reference semantic of the data.table package. If many columns are
added at once, a warning issued by the data.table package is
supplemented by an further explanatory warning of the polmineR package.
Filling up the data.table
was limited previously to
freq = FALSE
, this limitation is lifted.html()
method is implemented for
remote_subcorpus
objects.hits()
method is implemented for
remote_corpus
and remote_subcorpus
class
(#160).ranges
is introduced to manage ranges of
corpus positions for query matches. This is a preparatory step to remove
an inconsistency from the hits
class that mixed two very
usages (getting ranges of corpus positions for matches and getting
counts).ranges
serves as the constructor to
prepare a ranges
class object. In combination with
as.data.table()
, it replaces former functionality of
hits()
without argument s_attribute
.hits()
method is altered, making it
much more consistent than previously: The method will consistently
return a hits
object.hits()
has a new argument fill
that will report zeros for combinations of s-attributes with no matches
for a query.subset
for the subset
method
for remote_corpus
objects can now be a call (#162), this is
a basis for passing vectors to OpenCPU server. -
p_attributes()
implemented for remote_corpus
and remote_partition
.regions()
method (for corpus
class
objects to start with) returns a regions
class object with
a regions matrix (slot cpos
) with regions for an
s-attribute (#176).get_token_stream()
-method for regions
and matrix
objects will now accept a logical argument
split
. If TRUE
, a list of character vectors is
returned. The envisaged use case is a fast decoding of sentences
(#176).encoding()
method has been defined if argument
object
is missing. Calling encoding()
will
return the session character set. If it cannot be determined using
localeToCharset()
, a UTF-8 session charset will be assumed.
Internally, encoding()
replaces a direct call of
localeToCharset()
to avoid errors that have occurred on
GitHub Actions with Ubuntu 20.04 (#188).localeToCharset()
(NA
return value), a startup
message will issue a warning that ‘UTF-8’ is assumed (#188).size()
method is now able to handle nested
s-attributes.trim()
method for context
objects will
now accept a matrix with ranges a positivelist
argument.highlight()
method now acceps matrix
objects as elements of the list of items to be highlighted. It is
treated as a set of regions, such as resulting from cpos()
.
Thus it is possible to highlight matches for CQP queries.context()
method.count()
-method for partition_bundle
objects failed with an opaque error message if there were no query
matches at all. There is now a check for this scenario and the expected
table is returned (zero values throughout.)corpus
class is now a superclass for the
textstat
class, starting to create a more coherent class
structure in general. This is an important preparatory step to be able
to keep all registry files in the temporary registry directory. To avoid
a confusion in the class system resulting from the coerce method from
partition
to corpus
objects, this coerce
method (defined by setAs()
) has been removed. The
get_template()
-method for partition
objects
using this coerce method has been removed - as it inherits the method
anyway, it is not needed any more. See #201.region
) and to consider the
changing value of an s-attribute as a boundary of a context (argument
boundary
). New menu “boundary” and radio buttons,
conditional on presence of s-attributes “s” and/or “p”.sAttribute
or pAttribute
(instead of s_attribute
and p_attribute
) are
still used with dispersion()
method, a warning is issued
declaring that the argument is deprecated..onDetach()
to .onUnload()
(#164).as.phrases()
method (#172).as.corpusEnc()
auxiliary function will now check
whether non-convertible characters lead to an NA
result and
issue a warning how this warning can be avoided (#151).context()
method for matrix
objects if arguments left
and right
are named integer
vectors. All
context()
benefit from the improved performance of this
worker for creating contexts for query matches.context
object.enrich()
method for context
objects
will now perform an in-place operation when adding new
s-attributes.as.cqp()
function includes arguments
check
and warn
for running
check_cqp_query()
on queries.context()
method for matrix
objects
includes a new argument boundary
and relies on a new
functionRcppCWB::region_matrix_context()
.verbose
of
context()
-methods is now FALSE
.as.corpusEnc()
auxiliary function now includes a
test whether input character vector includes unexpected encodings and
issues a warning if this is the case.cpos()
method will now check for accidental leading
and/or trailing whitespace and remove it for token lookup. Note that
hits()
, count()
and dispersion()
will report queries without removing whitespace.count()
-method for
partition_bundle
objects will be much more efficient when
many columns with zero matches need to be added. The implementation
avoids a data.table warning when the bulk action of adding new columns
exceeds the number of columns reserved by data.table objects.trim()
is removed (#197).encoding()
relies on l10n_info()
before
using localeToCharset()
as a matter of performance and
robustness (#196).corpus
has a new slot registry_dir
.
This is a preparatory step that will facilitate managing corpora
described by registry files in different registry directories.corpus()
for corpus
-class
objects has an argument registry_dir
that will be required
to distinguish corpora described by registry files in different registry
directories.fs_path
classes.registry_get_home()
and
registry_get_encoding()
have been replaced by RcppCWB
functions cl_charset_name()
and
corpus_data_dir()
with equivalent result, but faster due to
immediate access to C representation of the corpus.corpus()
method will deduce the registry directory
from the C representation of the corpus if possible.as.markdown()
has been removed, making fulltext display (using read()
or
html()
) much faster.corpus()
without any arguments now returns an
expanded data.frame
reporting all slots of the
corpus
class objects, skipping only the data directory of
the corpus.cpos()
method for matrix
objects that
turns a matrix with corpus positions into a vector of
integer
values now relies on a C-level implementation newly
included in the RcppCWB package, that is significantly faster than the
best possible implementation in R.kwic()
shows row numbers, which
is convenient when referring to specific rows (#184).as.cqp()
now checks whether argument
query
meets the expectation that it is a query (#191).make_region_matrix()
, which has been used
internally only, has been removed.
RcppCWB::s_attr_regions()
replaces the functionality.as.speeches()
method had not yet been implemented
for nested corpora. A limited rewrite makes this work now (#198).get_token_stream()
method for partition_bundle
objects have been addressed: Multiple p-attributes can be used without
providing phrases
at the same time (#142) and using the
subset
argument does not depend on using
phrases
either (#141).as.sparseMatrix()
method is now also defined for
DocumentTermMatrix
objects (was available previously ony
for TermDocumentMatrix
objects).hits()
method (#195).get_type()
for subcorpus_bundle
returns
NULL
if no type is defined as a matter of consistency
(#169).corpus
/subcorpus
includes invalid
s-attributes, the warning is telling and NULL
is returend
(#179).cooccurrences()
method - left/right
rather than window (#134).kwic
and context
now have argument
region
as an intuitive alternative to named
character
vectors left
and right
when expanding match to left and right limitation of an
s-attribute.deparse()
within is resolved (#161).hits()
method for the slice
virtual
class has been removed and the implementation for hits
for
the subcorpus
class is now real worker, also invoked for
hits()
for partition
. This removes a bug that
occurred when applying hits
on subcorpus
objects, which resulted in a count for the whole corpus.show()
-method for
partition
objects resvolved when more than one s-attribute
has been used to define partition
(#170).left
and right
of the
context()
-method for matrix
objects, the
worker behind the context()
, kwic()
and
cooccurrences()
methods did not work as intended for
character
values specifying an s-attribute. Fixed - it is
not possible to use these arguments (#173).as.TermDocumentMatrix()
or
as.DocumentTermMatrix()
when a s-attribute would not cover
the entire corpus has been removed (#177). In this vein, an efficiency
(decoding token stream twice) has been removed, so performance will also
be better.subset()
for remote_corpus
objects(#181) has
been fixed.context()
method, and kwic()
for
partition
or subcorpus
objects did not process
left and right contexts correctly, if it was a named character vector.
Fixed.hits()
method failed for
partition_bundle
objects when there were no matches for the
query. Fixed. (#199 and #163)p_attributes()
method for slice
objects had an error when decoding the token stream. Fixed.format()
on a
features_ngrams
object resulting in an error when using
knit_print()
on this object has been fixed (#200).edit()
method can now be invoked on a
features
object (#165).context()
-method for partition_bundle
objects always required an explicit statement of the argument
positivelist
, which is not necessary. Fixed. (#178)kwic()
method is gone as a result of refactoring how the s-attribute is matched
(#149). The argument progress
has been removed from the
method.as.DocumentTermMatrix()
method mistakenly returned
as TermDocumentMatrix
object. Fixed (#146).noise()
method misleadingly handled the number of
characters provided by minNchar
as a maximum threshold, not
as a minimum requirement (#135). Fixed.hits
class now describes the
data.table
in the stat
slot of the class in
detail.decode()
method for data.table
objects shall serve as a more user-friendly access to the efficiency of
the RcppCWB::cl_cpos2str()
function.data.frame
returned when calling
corpus()
will now include a column with the encoding of the
corpus.warn
argument of the
get_template()
-method remained unused, resulting in a
warning message even if warn
was FALSE
,
resulting in a set of warning messages when calling
corpus()
. The argument is used as intended now and defaults
to FALSE
.as.markdown()
-method for subcorpus
objects now uses an (internal) default template accessible via
polmineR:::default_template
, if no template is defined for
a corpus.registry_get_encoding()
function returned a
length-one character vector if the regular expression to extract the
charset corpus property did not yield a match. To prevent errors, it now
returns “latin1” as the CWB standard encoding (#159).knit_print()
-method for textstat
objects does not accept the three dots argument any more. As an
installation of pandoc is necessary to include resulting
htmlwidget
in an html document, the method will check now
whether pandoc is available. If not, a formatted data.table
is returned.knit_print()
-method for kwic
objects
does not have the pagelength
argument any more as it has
been unused. The pagelength is controlled by the option
polmineR.pagelength
. Internally, the method will call the
method for the textstat
superclass of the kwic
class, which is newly robust against a missing installation of
pandoc.chisquare()
method needs to increase the number of
digits temporarily, but failed to revert to the original value as
expected. One implication was, that rounding the values in
data.table
objects would fail, and rounding in general
yielded very strange results (#155). Fixed.as.data.table()
-method defined in the
data.table
is now reexported and defined and documented for
the textstat
, regions
and bundle
class that it can be used cleanly..importPolMineCorpus()
-function has
been superseded by cwbtools::corpus_install()
and has been
removed from the package.cat()
has been replaced by
massage()
within functions throughout to meet CRAN
requirements.type
has been dropped from the
html()
-method for partition_bundle
objects.html()
-method for character
class
objects now serves as a worker to generate html from markdown. The
html()
-method for partition_bundle
objects did
not return a html
class object as stated in the
documentation object. Fixed.store()
-method has been declared defunct as it is
unnecessary functionality that bloats the package. Using
format()
in combination with
openxlsx::write.xlsx()
is the recommended alternative
workflow.mail()
-method has been declared defunct and has
been removed from the package. A more user-friendly workflow is to use
export buttons of the DataTable widgets.Corpus
class has been removed from the package as
it has beeen defunct for a while.set_template()
method
on options that may be unnoticed for the user and that potentially
violate CRAN policies, the method has been dropped.s_attributes()
-method returned a
data.table
mixing up rows / columns for
subcorpora/partitions with a region matrix that would only include a
single set of corpusdecode()
-method now entails the possibility to
decode structural and positional attributes selectively, via new
arguments p_attributes
and s_attributes
(#116). Internally, the reliance on coerce()
-methods has
been replaced by a simpler if-else-syntax. The
as(from, "Annotation")
option persists, however.phrases
was added to the
count()
-method for partition_bundle
objects.remote_corpus
and the remote_subcorpus
class are replaced by a single
slot restricted
(values
TRUE
/FALSE
) to indicate if a user name and a
password are necessary to access a corpus. A file following the
conventions of CWB files is assumed to include the credentials for
corpus access. This approach avoids the accessibility of the
password.https://hub.docker.com/r/polmine/debian_polminer_min
).corpus()
-method that serves as a constructor either
for the corpus
or the remote_corpus
class does
not flag default values for the arguments user
and
password
any more. If the argument server
is
stated explicitly (not NULL
, default), these variables will
get the value character()
. This way, a set of if/else
statements can be omitted and it is much easier to implement methods for
the remote_corpus
class for corpora that are
password-protected, or not.as.list.bundle()
-method (previously, there has only been
the S4 method). The nice consequence is that lapply()
and
sapply()
can be used on bundle
objects now (a
subcorpus_bundle
, for instance)count()
-method for
partition_bundle
objects has been improved, it is twice as
fast now (#137).p_attributes
method now accepts an argument
decode
.p_attributes
-method has been implemented for
partition_bundle
objects.polmineR()
, the
mail-button has been dropped in the kwic, and code can be displayed
(using code highlighting)phrases
argument is used are now
also available when a phrases
object is not passed in.get_token_stream()
-method for
partition_bundle
objects will now accept an argument
phrases
(#128).merge()
-method for
partition_bundle
-objects has been reworked: Substantial
performance improvement by relying on
RcppCWB::get_region_matrix
. Internally, the method performs
a check whether the partition
/subcorpus
objects to be merged are non-overlapping. The default value for the
argument verbose
is now FALSE
, as waiting time
is much shorter.polmineR.warn.size
can be used to control
the issuing of warnings for large kwic
objects.Cooccurrences
objects had not been possible,
now at least using integer indices is possible (#114).count()
-method for slice
class objects.corpus()
method for a character vector will now
abort gracefully with a message if more than one corpus is offered as
.Object
.Cooccurrences()
-method will now accept zero values
(0) for the arguments left
and right
. Relevant
for detecting bigrams / phrases.data.table
of a
Cooccurrences
object, the NA values are pushed to the end
of the table now.concatenate()
method is a worker to collapse
tokens into phrases.Cooccurrences
class objects, see
pmi()
-method.ngrams()
-method for class
data.table
- useful if you need to work with decoded
corpora.pmi()
-method for the
ngrams()
-method, to provide a workflow for phrase
detection.enrich()
for object of class
Cooccurrences
will add columns with counts for the
co-occurring tokens to the data.table
in the slot
‘stat’.data.table
in the stat
slot of an
ngrams
object: Column names will now be “word_1” , “word_2”
etc.count()
for
subcorpus_bundle
objects (just callling
callNextMethod()
internally) - useful to see the
availability of the method in the documentation object.as.speeches()
-method for corpus
objects now supports parallelizationDocumentTermMatrix
against each other, as a safeguard that
different approaches might lead to different results (#139).phrases
and as.phrases()
-method
for ngrams
and matrix
objects. The
count()
-method now accepts an argument
phrases
. See the documentation
(?phrases
).s_attributes()
-method is now consistent with the
usage of the unique
argument (#133).hits()
-method for partition_bundle
objects now accepts an argument s_attribute
to include
metadata in results (#74).check_cqp_query()
function now has a further
argument warn
. If TRUE
(default), a warning is
issued, if the query is buggy. The as.phrases()
-method will
use the function to avoid that buggy CQP queries may be generated.Corpus
class has been re-introduced (temporarily),
to avoid an issue with the GermaParl package if the class is not
available (#127).get_template()
-method is now defined for the
corpus
class.count()
-method with arguments
breakdown
is TRUE
and cqp
is
TRUE
has been awfully slow. Fast now.boost
allows user to opt
for the improvement, which will involve decoding the lexicon
directly.merge()
-method is implemented for
subcorpus_bundle
objects now, and has been implemented for
subcorpus
objects (#76).kwic
view from a
cooccurrences
object based on more than one p-attribute
will work now (#119).decode()
-method has been defined for
integer
vectors. Internally it will decide whether decoding
token ids is speeded up by reading in the lexicon file directly. The
behavior can be triggered explicitly by setting the argument
boost
as TRUE
.get_token_stream()
-method will use the new
decode()
-method for integer values internally. The argument
boost
is used by the get_token_stream()
to
control the approach.get_token_stream
for partition_bundle
.partition_bundle()
-methods defined for
character
, corpus
and partition
objects now call the split()
-methods for
corpus
and subcorpus
objects, resulting in a
huge performance gain (#112).Cooccurrences()
-method
(#117).corpus
class includes a (new) slot
size
, just as the regions
and the
subcorpus
classes.split()
-method for corpus
objects now
accepts the argument xml
, to indicate whether the
annotation structure of the corpus is flat or nested.partition
now includes a
prototype defining default values for the slots ‘stat’ (a
data.table
) and the slot ‘size’ (NA_integer_
).
This avoids that an incomplete initialization of a
partition
object will result in an error.kwic()
-method is now available for
partition_bundle
/subcorpus_bundle
-objects
(#73).kwic()
-method work correctly for
partition
objects that result from a merge()
operation, the cpos()
-method for slice
objects
will extract strucs based on the s-attribute defined in the slot
s_attr_strucs
rather than the last s-attribute in the list
of the slot s-attributes
.subcorpus
is exported for usage in other
packages.progress
of the
count()
-method for partition_bundle
objects is
now FALSE.get_type()
-method is now defined for the
corpus
class.corpus
object into a subcorpus
object, to recover functionality
used (internally) that relied on the former Corpus
reference class.Cooccurrences()
-method is now defined for the
corpus
-class, too. The Cooccurrences()
-method
for the character
class now relies on this method.Corpus
reference class has been dropped
from the code altogether: As roxygen::roxygenize()
started
to check the documentation of R6 classes and reference classes, the poor
documentation of this class started to provoke many errors. Rather than
starting to write documentation for a deprecated class, getting rid of
an outdated and poorly documented class appeared to be the better
solution.kwic
object from a
cooccurrences
object. Introduced to serve as a basis for
quantitative/qualitative workflows, e.g. integrated in a
flexdashboard.s_attributes()
method for corpus
objects when
values are requested for an s-attribute that does not exist (#122).decode()
-method for subcorpus
objects, s-attributes were not decoded appropriately (#120). Fixed. When
decoding a corpus/subcorpus, the struc column is kept (again)..onLoad()
whether polmineR is loaded
from the repository directory will ensure that temporary registry files
will not be gone when calling devtools::document()
(#68).as.speeches()
-method for corpus
objects, setting progress
as FALSE
did not
suppress the display of a progress bar. Solved.subcorpus_bundle
that resulted from CQP queries
being turned into invalid column names.partition_bundle
was an empty string and calling
count()
on this object has been removed (#121).?polmineR
)corpus
class has been put in a shape to become the
default point of departure of most workflows. All core methods are now
available for the corpus
class, and have been implemented
newly if necessary, e.g. show()
and
size()
-method. The constructor method for a
corpus
object, the corpus()
method, will now
check whether the character vector with the corpus ID refers to an
available corpus, whether all letters are upper case and issue
informative warnings and error messages.s_attributes()
-method for corpus
objects has been reworked: It will decode binary files directly, without
reliance on the corpus library functions, which is significantly
faster.Corpus
reference class is now obsolete after the
introduction of the S4 corpus
class. To maintain the
functionality not covered otherwise, new generics get_info
and show_info
have been introduced and defined for the
corpus
class.subcorpus
class have been
expanded so that this class can supersede the partition
class: Methods newly available are cpos()
,
count()
, p_attributes()
,
s_attributes()
get_token_stream()
, and
size()
. Technically, there is virtual
slice
-class, from which subcorpus
inherits
(methods called via callNextMethod()
).subset()
-method for the corpus
and
subcorpus
classes to generate subcorpora
(i.e. subcorpus
objects) has been introduced. It
outperforms the partition()
method. The
subset()
-method for corpus
and
subcorpus
objects will be the default way to work with non
standard evaluation in a manner that feels “R-ish” (#40).zoom()
-method that has been introduced
experimentally has been dropped again in favor of the
subset()
-method to get subcorpus
objects from
corpus
and subcorpus
objects. A set of
experimental methods for an initial check of the feasibility of a
non-standard evaluation approach to the generation of subcorpora has
been dropped (methods $
, ==
, !=
,
zoom
for corpus
-class).partition
class
(inheriting from the textstat
class) to the
subcorpus
class (inheriting from the textstat
class), there is a new coerce()
-method to turn a
partition
object into a subcorpus
object.remote_corpus
-class is the basis for accessing
remote corpora. A remote_subcorpus
can be derived from a
remote_corpus
. Methods available for remote corpora und
subcorpora remain limited at this stage.subcorpus_bundle
class now inherits from
partition_bundle
. This is not intended to be a long-term
solution, but facilitates the implementation of new workflows based on
the subcorpus
class rather than the partition
class.polmineR
did not
have safeguards if the suggested packages shiny and shinythemes
were not installed. Now there will be a conditional installation of the
packages required for running the shiny app.CorpusOrSubcorpus
has been
removed. The ngrams
-method now applies for
corpus
and subcorpus
objects.label()
-method, present for a while, is superseded
by a edit()
-method now. It will call a shiny gadget either
using DataTables or Handsontable. The former Labels
reference class has been turned into a S4 class, because the desired
reference logic can also be achieved with a data.table
in a
slot of the labels class.table
-slot of the kwic
class has been
renamed as stat
slot (a data.table
), so that
the kwic
class can now inherit from the
textstat
class. The enrich()
-method for
objects of class kwic
now includes a new argument
extra
that will add extra tokens to the left of the windows
for concordances so that qualitative inspections for query hits can work
with more context.as.TermDocumentMatrix()
and the
as.DocumentTermMatrix()
-methods are now also defined for
kwic
objects. They work exactly the same as for the
context
class. To avoid having to write new methods, a new
neighborhood
virtual class has been introduced. The
aforementioned methods are defined for the virtual class and are
available for context and kwic class objects.get_token_stream()
for a partition_bundle
object.Cooccurrences()
-method is now available for
subcorpus
-objects (#88).kwic
-object into
a context
-object. The neighborhood
virtual
class could be discarded again, and a bug could be removed that left an
enrich()
-operation for kwic
objects (argument
p_attribute
) ineffectual (#103).cpos
to FALSE
in the kwic()
-method has been solved
(#106), and the documentation of the argument has been rewritten so that
includes a warning to use the argument falsely.use()
(#72).regex
to the
cpos()
-method (for corpus
objects), which will
interpret argument query
as a regular expression. This may
be faster than taking query
as an outright CQP query.dispersion
-method (#92).p_attribute
and positivelist
by default.format()
-method is used to create proper output in
the cooccurrences of the shiny app.registry()
-function.ll
-method had been
somewhat mixed up, which is repaired now. Tokens with NA values for the
ll-test will show up at the end of the table.registry_move()
-function, used only internally at
this stage, is exported now so that it can be used by other
packages.the get_token_stream()
-method for
regions
objects was a data.table
. The behavior
is now in line with the other get_token_stream()
methodstempcorpus()
-method and the tempcorpus
class have been removed from the package, having become utterly
deprecated.summary()
-method for partition
-class
objects has been turned into a method for the count
-class,
to eliminate an inconsistency. The example of a workflow has been moved
to the documentation object for the count
-class.browse()
-method has not proven to be useful and has
been removed from the package. A new browse()
-function is
introduced to throw a warning, if browse should be called
nevertheless.split()
-method for
partition
-objects improved the readability of the code, but
the performance gain is minimal.kwic_bundle
-class has been introduced, a list of
kwic
objects can be turned into this new class using
as.bundle
.context()
-method will now take again as input
character vectors for the arguments left
and
right
to expand to the left and right boundaries of the
designated region (#87).kwic()
-method. This ensures that
subsequent highlighting operations can assign new colors (#38).dispersion()
that
results are reported for all values of structural attributes, including
those with zero matches. (#104)cpos
-method for
matrix
which unfolds a matrix with regions of corpus
positions, useful for operations that require many calls.count
-method for partition_bundle
has
been reworked and is much faster and more memory efficient.as.TermDocumentMatrix()
for
partition_bundle
optimized to work efficiently with large
corpora.as.corpusEnc()
-function uses the
localeToCharset()
-function from the utils package to
determine the charset of input strings. On RStudio Server, we have seen
cases when the return value is NA. Then it will be assumed that the
locale is UTF-8.context()
/kwic()
method that led to superfluous words in the right context.as.data.frame()
-method for kwic
-objects when
no metadata were added.count()
-method for
partition_bundle
-objects did not perform
iconv()
if necessary - this has been corrected.kwic
object did not
reduce the cpos
table concurringly. This has been
corrected.as.speeches()
-method failed to handle situations
correctly, when one speaker occurring in the corpus only contributed one
single region to the entire corpus (#86). This behavior has been
debugged.partition_bundle
started to throw a
warning that an argument arrives at the cpos()
-method that
is not used. The cause for the warning message is removed, an additional
unit test has been introduced to recognize issues with the
count
-method (#90).kwic()
-method threw an error when trimming the
matches by using a positivelist or a stoplist resulted in no remaining
matches. The method will now return a NULL object and keep issuing a
warning if no matches remain after filtering (#91).subcorpus
object, resulting in
false results when counting over subcorpora. Fixed.dispersion()
(#62).as.speeches()
-method, the argument
verbose
was not used (#64) - this had been addressed when
solving issue #86.subcorpus
into
a String
was removed: A semicolon was not recognized as a
punctuation mark. This makes decoding subcorpora as
Annotation
more robust. The respective unit test has been
updated.read()
on a kwic
object works
again (#84).as.VCorpus()
method that failed are now
ok (#77). The reason was that get_token_stream()
assumed
implicitly that a p-attribute “pos” is present, which is not the case
for the REUTERS test corpus.s_attributes
-method was removed that
would make retrieving the metadata for the first strucs (index 0) of a
s-attribute impossible.as.DocumentTermMatrix
that started
to occur with the introduction of the subcorpus_bundle
class (#100).kwic
-method for
character
that prevented using different values for right
and left context (#101).as.DocumentTermMatrix()
on a corpus stated by corpus ID /
length-one character vector (#105).markdown::markdownToHTML
by a direct call to
markdown::renderMarkdown
. On this occasion, some overhead
preparing fulltext output has been removed.kwic
objects has been removed (#102).as.TermDocumentMatrix()
-method for
neighborhood
-objects returned a DocumentTermMatrix
(unintendedly), this bug is removed now.pmi()
-method and
t_test()
-method.s_attributes()
-method for
corpus
-class.corpus
-class has been
rewritten entirely, and the documentation for the
remote_corpus
-class has been integrated, whereas methods
applicable to the remote_corpous
-class were integrated into
the documentation objects for the respective methods.get_token_stream()
-method has
been reworked and expanded thoroughly (#65). On this occasion, test
coverage for the method has been improved significantly. (Everything is
tested now apart from parallelization.)Cooccurrences()
-method and a
Cooccurrences
-class have been migrated from the
(experimental) polmineR.graph package to polmineR to generate and manage
all cooccurrences in a corpus/partition
. A
cooccurrenes()
-method produces a subset of
Cooccurrences
-class object and is the basis for ensuring
that results are identical.data_dir()
will return this temporary data
directory. The use()
-function will now check for non-ASCII
characters in the path to binary corpus data and move the corpus data to
the temporary data directory (a subdirectory of the directory returned
by data_dir()
), if necessary. An argument tmp
added to use()
will force using a temporary directory. The
temporary files are removed when the package is detached.zoom()
-method. See documentation
for (new) corpus
-class (?"corpus-class"
) and
extended documentation for partition
-class
(?"partition-class"
). A new corpus()
-method
for character vector serves as a constructor. This is a beginning of
somewhat re-arranging the class structure: The
regions
-class now inherits from the new
corpus
-class, and a new subcorpus
-class
inherits from the regions
-class.check_cqp_query()
offers a preliminary
check whether a CQP query may be faulty. It is used by the
cpos()
-method, if the new argument check
is
TRUE. All higher-level functions calling cpos()
also
include this new argument. Faulty queries may still cause a crash of the
R session, but the most common source is prevent now, hopefully.format()
-method is defined for textstat
,
cooccurrences
, and features
, moving the
formatting of tables out of the view()
, and
print()
-methods. This will be useful when including tables
in R Markdown documents.highlight()
-method for character
and
html
objects now has the arguments regex
and
perl
, so that regular expressions can be used for
highlighting (#99).as.data.frame()
-method for
kwic
-objects has seen a small performance improvement, and
is more robust now if the order of columns changes unexpectedly.registry()
and data_dir()
now accept an argument pkg
. The functions will return the
path to the registry directory / the data directory within a package, if
the argument is used.data.table
-package used to be imported entirely,
now the package is imported selectively. To avoid namespace conflicts,
the former S4 method as.data.table()
is now a S3 method.
Warnings appearing if the data.table
package is loaded
after polmineR are now omitted.coerce()
-methodes to turn textstat
,
cooccurrences
, features
and kwic
objects into htmlwidgets now set a pageLength
.partition_bundle
objects:
[[<-
, $
, $<-
textstat
objects.p_attribute
has been added to the
kwic
-class; kwic()
-methods and methods to
process kwic
-objects are now able to use the attribute thus
indicated, and not just the p-attribute “word”.size()
-method for context
-objects
will return the size of the corpus of interest (coi) and the reference
corpus (ref).encoding()
-method for character vector.name()
-method for character vector.count()
-method for context
-objects
will return the data.table
in the stat
-slot
with the counts for the tokens in the window.decode()
-function replaces a
decode()
-method and can be applied to partitions. The
return value is a data.table
which can be coerced to a
tibble
, serving as an interface to tidytext (#37).ngrams()
-method will work for corpora, and a new
show()
-method for textstat
-object generates a
proper output (#27).tempdir()
is wrapped into normalizePath(…,
winslash = “/”), to avoid mixture of file separators in a path, which
may cause problems on Windows systems.kwic()
-method for corpora returned one surplus
token to the left and to the right of the query. The excess tokens are
not removed.kwic()
-method for
character
-objects method did not include the correct
position of matches in the cpos
slot. Corrected.partition_bundle
using the
as.speeches()
-method, an error could occur when an empty
partition has been generated accidentaly. Has been removed. (#50)as.VCorpus()
-method is not available if the
tm
-package has been loaded previously. A coerce method
(as(OBJECT, "VCorpus")) solves the issue. The
as.VCorpus()`-method
is still around, but serves as a wrapper for the formal coerce-method
(#55).verbose
as used by the
use()
-method did not have any effect. Now, messages are not
reported as would be expected, if verbose
is
FALSE
. On this occasion, we took care that corpora that are
activated are now reported in capital letters, which is consistent with
the uppercase logic you need to follow when using corpora. (#47)context()
-method would occurr at the very beginning
or very end of a corpus and the window would transgress the beginning /
end of the corpus without being checked (#44).as.speeches()
-function caused an error when the
type of the partition was not defined. Solved (#57).TermDocumentMatrix
from a partition_bundle
if the partitions in the
partition_bundle
were not named. The fix is to assign
integer numbers as names to the partitions (#58).ll()
,
and chisquare()
-methods to make the statistical procedure
used transparent.cooccurrences()
-method to
explain subsetting results vs applying positivelist/negativelist
(#28).round()
-method for
textstat
-objects that will show up in documentation of
textstat
class.mail()
-method (#31).decode()
-function, using the
REUTERS corpus replaces the usage of the GERMAPARLMINI corpus, to reduce
time consumed when checking the package.weigh()
-method has
been implemented for the classes count
and
count_bundle
. Via inheritance, it will also be available
for the partition
- and
partition_bundle
-classes. Then, a new
summary()
-method for partition
-class objects
is introduced. If the object has been weighed, the list that is returned
will include a report on weights. There is an example that explains the
workflow.partition_bundle
-method for
context
-objects has been reworked entirely (and is working
again); a new partition
-method for
context
-objects has been introduced. Buth steps are
intended for workflows for dictionary-based sentiment analysis.highlight()
-method is now implemented for class
kwic
. You can highlight words in the neighborhood of a node
that are part of a dictionaty.knit_print()
-method for textstat
-
and kwic
-objects offers a seamless inclusion of analyses in
Rmarkdown documents.coerce()
-method to turn a kwic
-object
into a htmlwidget has been singled out from the
show()
-method for kwic
-objects. Now it is
possible to generate a htmlwidget from a kwic object, and to include the
widget into a Rmarkdown document.coerce()
-method to turn
textstat
-objects into an htmlwidget (DataTable), very
useful for Rmarkdown documents such as slides.html()
-method will allow
to define a scroll box. Useful to embed a fulltext output to a Rmarkdown
document.partition_bundle
-class, rather than inheriting from
bundle
-class directly, will now inherit from the
count_bundle
-classuse()
-function is limited now to activating the
corpus in data packages. Having introduced the session registry,
switching registry directories is not needed any more.as.regions()
-function has been turned into a
as.regions()
-method to have a more generic tool.context
-method, so that full
use of data.table
speeds up things.highlight()
-method allows definitions of terms to
be highlighted to be passed in via three dots (…); no explicit list
necessary.as.character()
-method for kwic-class objects is
introduced.size_coi
-slot (coi for corpus of interest) of the
context
-object included the node; the node (i.e. matches
for queries) is excluded now from the count of size_coi.use()
, the registry directory is reset for
CQP, so that the corpora in the package that have been activated can be
used with CQP syntax.s_attributes()
-method for
partition
-objects: “fast track” was activated without
preconditions.kwic
-output after highlighting.meta
has been
renamed to s_attributes
for the kwic()
-method
for context
-objects, and for the
enrich()
-method for kwic
-objects.s_attribute
to check for integrity within a struc has been
renamed into boundary
.kwic
-objects has been reworked
thoroughly.These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.