The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Overview of the phonics Package

James P. Howard, II

2021-07-11

The phonics package for R is designed to provide a variety of phonetic indexing algorithms in common and not-so-common use today. The algorithms generally reduce a string to a symbolic representation approximating the sound made by pronouncing the string. They can be used to match names, words, and as a proxy for assorted string distance algorithms.

Basic Usage

All algorithms, except the Match Rating Approach, accept a character vector or vector of character vectors as the input. These are converted to their phonetic spelling using the relevant algorithm. For example, we shall consider the Soundex and Refined Soundex algorithms. The Soundex algorithm is implemented as the soundex function and the Refined Soundex method is given in the refinedSoundex function, and we can observe them in the following examples.

library("phonics")

x1 <- "Catherine"
x2 <- "Kathryn"
x3 <- "Katrina"
x4 <- "William"

x <- c(x1, x2, x3, x4)

soundex(x1)

## [1] "C365"

soundex(x2)

## [1] "K365"

soundex(x)

## [1] "C365" "K365" "K365" "W450"

refinedSoundex(x1)

## [1] "C30609080"

refinedSoundex(x2)

## [1] "K3060908"

Both functions accept a maxCodeLen that limits the length of the returned code. Except where noted, all the algorithms support the maxCodeLen option to change the maximum or expected code length returned, as appropriate.

Beyond soundex, additional algorithms are available, as shown in the following table.

Algorithm	Function Name
Caverphone	caverphone
Cologne Phonetic	cologne
Lein Name Coding	lein
Metaphone	metaphone
New York State Identification and Intelligence System	nysiis
Oxford Name Compression Algorithm	onca
Phonex	phonex
Roger Root Name Coding Procedure	rogerroot
Statistics Canada Name Coding	statcan

Match Rating Approach

Unlike other algorithms described here, MRA is a two-stage algorithm with separate encoding and comparison routines. For instance, the results of Soundex on two different strings can be directly compared to test for equality:

soundex(x1) == soundex(x2)

## [1] FALSE

soundex(x2) == soundex(x3)

## [1] TRUE

However, the MRA encoding algorithm may return different encodings for similar strings that should match. So the second stage, for comparison, is used to compare to MRA-encoded strings. The encoding algorithm is provided by mra_encode and the comparison algorithm is provided by mra_compare.

(mra1 = mra_encode("Katherine"))

## [1] "KTHRN"

(mra2 = mra_encode("Catherine"))

## [1] "CTHRN"

(mra3 = mra_encode("Katarina"))

## [1] "KTRN"

mra_compare(mra1, mra2)

## [1] TRUE

mra_compare(mra1, mra3)

## [1] TRUE

mra_compare(mra2, mra3)

## [1] TRUE

The threshold necessary to establish similarity gets smaller as the encoded strings get larger. This leads to some interesting results. For instance, Catherine and William match as names.

mra_compare(mra_encode("Catherine"), mra_encode("William"))

## [1] TRUE

Summary

This paper has outlined the phonics package for R. Included in this package are several English-, German-, and French-language suitable algorithms for phonetically reducing names and strings. These can be used for comparison and indexing, as well as later record-linkage.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.