The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
Compute string distance the tidy way. Built on top of the ‘stringdist’ package.
You’ll get the dev version on:
::install_github("ColinFay/tidystringdist") devtools
Stable version is available with :
install.packages("tidystringdist")
First, you need to create a tibble with the combinations of words you
want to compare. You can do this with the tidy_comb
and
tidy_comb_all
functions. The first takes a base word and
combines it with each elements of a list or a column of a data.frame,
the 2nd combines all the possible couples from a list or a column.
If you already have a data.frame with two columns containing the strings to compare, you can skip this part.
library(tidystringdist)
tidy_comb_all(LETTERS[1:3])
#> # A tibble: 3 x 2
#> V1 V2
#> * <chr> <chr>
#> 1 A B
#> 2 A C
#> 3 B C
tidy_comb_all(iris, Species)
#> # A tibble: 3 x 2
#> V1 V2
#> * <chr> <chr>
#> 1 setosa versicolor
#> 2 setosa virginica
#> 3 versicolor virginica
tidy_comb("Paris", state.name[1:3])
#> # A tibble: 3 x 2
#> V1 V2
#> * <chr> <chr>
#> 1 Alabama Paris
#> 2 Alaska Paris
#> 3 Arizona Paris
Once you’ve got this data.frame, you can use
tidy_string_dist
to compute string distance. This function
takes a data.frame, the two columns containing the strings, and a
stringdist method.
Note that if you’ve used the tidy_comb
function to
create you data.frame, you won’t need to set the column names.
library(dplyr)
data(starwars)
<- tidy_comb_all(starwars, name)
tidy_comb_sw tidy_stringdist(tidy_comb_sw)
#> Warning in do_dist(a = b, b = a, method = method, weight = weight, maxDist
#> = maxDist, : Non-printable ascii or non-ascii characters in soundex.
#> Results may be unreliable. See ?printable_ascii.
#> # A tibble: 3,741 x 12
#> V1 V2 osa lv dl hamming lcs qgram
#> * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Luke Skywalker C-3PO 14 14 14 Inf 19 19
#> 2 Luke Skywalker R2-D2 14 14 14 Inf 19 19
#> 3 Luke Skywalker Darth Vader 11 11 11 Inf 17 17
#> 4 Luke Skywalker Leia Organa 11 11 11 Inf 17 15
#> 5 Luke Skywalker Owen Lars 12 12 12 Inf 15 11
#> 6 Luke Skywalker Beru Whitesun lars 16 16 16 Inf 22 18
#> 7 Luke Skywalker R5-D4 14 14 14 Inf 19 19
#> 8 Luke Skywalker Biggs Darklighter 13 13 13 Inf 21 19
#> 9 Luke Skywalker Obi-Wan Kenobi 14 14 14 14 24 22
#> 10 Luke Skywalker Anakin Skywalker 5 5 5 Inf 8 8
#> # ... with 3,731 more rows, and 4 more variables: cosine <dbl>,
#> # jaccard <dbl>, jw <dbl>, soundex <dbl>
Default call compute all the methods. You can use specific method
with the method
argument:
tidy_stringdist(tidy_comb_sw, method = c("osa","jw"))
#> # A tibble: 3,741 x 4
#> V1 V2 osa jw
#> * <chr> <chr> <dbl> <dbl>
#> 1 Luke Skywalker C-3PO 14 1.0000000
#> 2 Luke Skywalker R2-D2 14 1.0000000
#> 3 Luke Skywalker Darth Vader 11 0.5752165
#> 4 Luke Skywalker Leia Organa 11 0.5335498
#> 5 Luke Skywalker Owen Lars 12 0.4624339
#> 6 Luke Skywalker Beru Whitesun lars 16 0.4656085
#> 7 Luke Skywalker R5-D4 14 1.0000000
#> 8 Luke Skywalker Biggs Darklighter 13 0.5728291
#> 9 Luke Skywalker Obi-Wan Kenobi 14 0.6349206
#> 10 Luke Skywalker Anakin Skywalker 5 0.2816558
#> # ... with 3,731 more rows
The goal is to provide a convenient interface to work with other tools from the tidyverse.
tidy_stringdist(tidy_comb_sw, method= "osa") %>%
filter(osa > 20) %>%
arrange(desc(osa))
#> # A tibble: 11 x 3
#> V1 V2 osa
#> <chr> <chr> <dbl>
#> 1 C-3PO Jabba Desilijic Tiure 21
#> 2 C-3PO Wicket Systri Warrick 21
#> 3 R2-D2 Wicket Systri Warrick 21
#> 4 R5-D4 Wicket Systri Warrick 21
#> 5 Jabba Desilijic Tiure IG-88 21
#> 6 Jabba Desilijic Tiure Cordé 21
#> 7 Jabba Desilijic Tiure R4-P17 21
#> 8 Jabba Desilijic Tiure BB8 21
#> 9 IG-88 Wicket Systri Warrick 21
#> 10 Wicket Systri Warrick R4-P17 21
#> 11 Wicket Systri Warrick BB8 21
%>%
starwars filter(species == "Droid") %>%
tidy_comb_all(name) %>%
tidy_stringdist() %>%
summarise_if(is.numeric, mean)
#> # A tibble: 1 x 10
#> osa lv dl hamming lcs qgram cosine jaccard jw
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4.4 4.4 4.4 Inf 7.4 7.4 0.8304896 0.8671032 0.6422222
#> # ... with 1 more variables: soundex <dbl>
Questions and feedbacks welcome!
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.