The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Helper functions

2024-09-20

There are several “helper” functions which can simplify the definition of complex patterns. First we define some functions that will help us display the patterns:

one.pattern <- function(pat){
  if(is.character(pat)){
    pat
  }else{
    nc::var_args_list(pat)[["pattern"]]
  }
}
show.patterns <- function(...){
  L <- list(...)
  str(lapply(L, one.pattern))
}

nc::field for reducing repetition

The nc::field function can be used to avoid repetition when defining patterns of the form variable: value. The example below shows three (mostly) equivalent ways to write a regex that captures the text after the colon and space; the captured text is stored in the variable group or output column:

show.patterns(
  "variable: (?<variable>.*)",      #repetitive regex string
  list("variable: ", variable=".*"),#repetitive nc R code
  nc::field("variable", ": ", ".*"))#helper function avoids repetition
#> List of 3
#>  $ : chr "variable: (?<variable>.*)"
#>  $ : chr "(?:variable: (.*))"
#>  $ : chr "(?:variable: (?:(.*)))"

Note that the first version above has a named capture group, whereas the second and third patterns generated by nc have an un-named capture group and some non-capturing groups (but they all match the same pattern).

Another example:

show.patterns(
  "Alignment (?<Alignment>[0-9]+)",
  list("Alignment ", Alignment="[0-9]+"),
  nc::field("Alignment", " ", "[0-9]+"))
#> List of 3
#>  $ : chr "Alignment (?<Alignment>[0-9]+)"
#>  $ : chr "(?:Alignment ([0-9]+))"
#>  $ : chr "(?:Alignment (?:([0-9]+)))"

Another example:

show.patterns(
  "Chromosome:\t+(?<Chromosome>.*)",
  list("Chromosome:\t+", Chromosome=".*"),
  nc::field("Chromosome", ":\t+", ".*"))
#> List of 3
#>  $ : chr "Chromosome:\t+(?<Chromosome>.*)"
#>  $ : chr "(?:Chromosome:\t+(.*))"
#>  $ : chr "(?:Chromosome:\t+(?:(.*)))"

nc::quantifier for fewer parentheses

Another helper function is nc::quantifier which makes patterns easier to read by reducing the number of parentheses required to define sub-patterns with quantifiers. For example all three patterns below create an optional non-capturing group which contains a named capture group:

show.patterns(
  "(?:-(?<chromEnd>[0-9]+))?",                #regex string
  list(list("-", chromEnd="[0-9]+"), "?"),    #nc pattern using lists
  nc::quantifier("-", chromEnd="[0-9]+", "?"))#quantifier helper function
#> List of 3
#>  $ : chr "(?:-(?<chromEnd>[0-9]+))?"
#>  $ : chr "(?:(?:-([0-9]+))?)"
#>  $ : chr "(?:(?:-([0-9]+))?)"

Another example with a named capture group inside an optional non-capturing group:

show.patterns(
  "(?: (?<name>[^,}]+))?",
  list(list(" ", name="[^,}]+"), "?"),
  nc::quantifier(" ", name="[^,}]+", "?"))
#> List of 3
#>  $ : chr "(?: (?<name>[^,}]+))?"
#>  $ : chr "(?:(?: ([^,}]+))?)"
#>  $ : chr "(?:(?: ([^,}]+))?)"

nc::alternatives for simplified alternation

We also provide a helper function for defining regex patterns with alternation. The following three lines are equivalent.

show.patterns(
  "(?:(?<first>bar+)|(?<second>fo+))",
  list(first="bar+", "|", second="fo+"),
  nc::alternatives(first="bar+", second="fo+"))
#> List of 3
#>  $ : chr "(?:(?<first>bar+)|(?<second>fo+))"
#>  $ : chr "(?:(bar+)|(fo+))"
#>  $ : chr "(?:(bar+)|(fo+))"

nc::alternatives_with_shared_groups for alternatives with identical named sub-pattern groups

Sometimes each alternative is just a re-arrangement of the same sub-patterns. For example consider the following subjects, each of which are dates, in one of two formats.

subject.vec <- c("mar 17, 1983", "26 sep 2017", "17 mar 1984")

In each of the two formats, the month consists of three lower-case letters, the day consists of two digits, and the year consists of four digits. Is there a single pattern that can match each of these subjects? Yes, such a pattern can be defined using the code below,

pattern <- nc::alternatives_with_shared_groups(
  month="[a-z]{3}",
  day=list("[0-9]{2}", as.integer),
  year=list("[0-9]{4}", as.integer),
  list(american=list(month, " ", day, ", ", year)),
  list(european=list(day, " ", month, " ", year)))

In the code above, we used nc::alternatives_with_shared_groups, which requires two kinds of arguments:

The pattern can be used for matching, and the result is a data table with one column for each unique name,

(match.dt <- nc::capture_first_vec(subject.vec, pattern))
#>        american  month   day  year    european
#>          <char> <char> <int> <int>      <char>
#> 1: mar 17, 1983    mar    17  1983            
#> 2:                 sep    26  2017 26 sep 2017
#> 3:                 mar    17  1984 17 mar 1984

After having parsed the dates into these three columns, we can add a date column:

Sys.setlocale(locale="C")#to recognize months in English.
#> [1] "C"
match.dt[, date := data.table::as.IDate(
  paste(month, day, year), format="%b %d %Y")]
print(match.dt, class=TRUE)
#>        american  month   day  year    european       date
#>          <char> <char> <int> <int>      <char>     <IDat>
#> 1: mar 17, 1983    mar    17  1983             1983-03-17
#> 2:                 sep    26  2017 26 sep 2017 2017-09-26
#> 3:                 mar    17  1984 17 mar 1984 1984-03-17

Another example is parsing given and family names, in two different formats:

nc::capture_first_vec(
  c("Toby Dylan Hocking","Hocking, Toby Dylan"),
  nc::alternatives_with_shared_groups(
    family="[A-Z][a-z]+",
    given="[^,]+",
    list(given_first=list(given, " ", family)),
    list(family_first=list(family, ", ", given))
  )
)
#>           given_first      given  family        family_first
#>                <char>     <char>  <char>              <char>
#> 1: Toby Dylan Hocking Toby Dylan Hocking                    
#> 2:                    Toby Dylan Hocking Hocking, Toby Dylan

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.