The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Type: Package
Title: Cluster Strings by Edit-Distance
Version: 1.0
Author: Dan S. Reznik
Maintainer: Dan S. Reznik <dreznik@gmail.com>
Description: Returns an edit-distance based clusterization of an input vector of strings. Each cluster will contain a set of strings w/ small mutual edit-distance (e.g., Levenshtein, optimum-sequence-alignment, Damerau-Levenshtein), as computed by stringdist::stringdist(). The set of all mutual edit-distances is then used by graph algorithms (from package 'igraph') to single out subsets of high connectivity.
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Imports: magrittr, dplyr, stringi, stringr, stringdist, igraph, assertthat, forcats, rlang, tidygraph, ggraph, ggplot2
Depends: R (≥ 3.1)
RoxygenNote: 6.1.1
NeedsCompilation: no
Packaged: 2019-03-26 18:10:58 UTC; dreznik
Repository: CRAN
Date/Publication: 2019-03-30 16:10:03 UTC

Plot string clusters as graph.

Description

Plot string clusters as graph.

Usage

cluster_plot(cluster, min_cluster_size = 2, label_size = 2.5,
  repel = T)

Arguments

cluster

string clusters returned from 'cluster_strings()'

min_cluster_size

minimum size for clusters to be plotted.

label_size

how big should the cluster name fonts be.

repel

whether to "repel" (so cluster names won't overlap)

Value

a graph plot (using 'ggraph') of the string clusters.

Examples

s_vec <- c("alcool","alcohol","alcoholic","brandy","brandie","cachaça")
s_clust <- cluster_strings(s_vec,method="lv",max_dist=3,algo="cc")
cluster_plot(s_clust,min_cluster_size=1)

Cluster Strings by Edit-Distance

Description

Cluster Strings by Edit-Distance

Usage

cluster_strings(s_vec, clean = T, method = "osa", max_dist = 3,
  algo = "cc")

Arguments

s_vec

a vector of character strings

clean

whether to space-squish and de-duplicate s_vec

method

one of "osa","lv","dl" (as in 'stringdist')

max_dist

max distance (typically damerau-levenshtein) between related strings.

algo

one of "cc" (connected components) or "eb" (edge betweeness)

Value

a data frame containing cluster membership for each input string

Examples

s_vec <- c("alcool","alcohol","alcoholic","brandy","brandie","cachaça")
s_clust <- cluster_strings(s_vec,method="lv",max_dist=3,algo="cc")
s_clust$df_clusters

Distinct words in Cervantes' "Don Quijote".

Description

Dataframe listing all distinct words (length>3), their length, and frequency of appearance in text.

Usage

quijote_words

Format

A data frame w/ ~22k rows and 3 cols:

word

the unique word, in Spanish

len

the word's length

freq

number of appearances in text

Source

http://www.gutenberg.org/cache/epub/2000/pg2000.txt

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.