Consider the following vector which contains genome position strings,
pos.vec <- c(
"chr10:213,054,000-213,055,000",
"chrM:111,000",
"chr1:110-111 chr2:220-222") # two possible matches.
To capture the first genome position in each string, we use the following syntax. The first argument is the subject character vector, and the other arguments are pasted together to make a capturing regular expression. Each named argument generates a capture group; the R argument name is used for the column name of the result.
(chr.dt <- nc::capture_first_vec(
pos.vec,
chrom="chr.*?",
":",
chromStart="[0-9,]+"))
#> chrom chromStart
#> 1: chr10 213,054,000
#> 2: chrM 111,000
#> 3: chr1 110
str(chr.dt)
#> Classes 'data.table' and 'data.frame': 3 obs. of 2 variables:
#> $ chrom : chr "chr10" "chrM" "chr1"
#> $ chromStart: chr "213,054,000" "111,000" "110"
#> - attr(*, ".internal.selfref")=<externalptr>
We can add type conversion functions on the same line as each named argument:
keep.digits <- function(x)as.integer(gsub("[^0-9]", "", x))
(int.dt <- nc::capture_first_vec(
pos.vec,
chrom="chr.*?",
":",
chromStart="[0-9,]+", keep.digits))
#> chrom chromStart
#> 1: chr10 213054000
#> 2: chrM 111000
#> 3: chr1 110
str(int.dt)
#> Classes 'data.table' and 'data.frame': 3 obs. of 2 variables:
#> $ chrom : chr "chr10" "chrM" "chr1"
#> $ chromStart: int 213054000 111000 110
#> - attr(*, ".internal.selfref")=<externalptr>
Below we use list variables to create patterns which are re-usable, and we use an un-named list to generate a non-capturing optional group:
pos.pattern <- list("[0-9,]+", keep.digits)
range.pattern <- list(
chrom="chr.*?",
":",
chromStart=pos.pattern,
list(
"-",
chromEnd=pos.pattern
), "?")
nc::capture_first_vec(pos.vec, range.pattern)
#> chrom chromStart chromEnd
#> 1: chr10 213054000 213055000
#> 2: chrM 111000 NA
#> 3: chr1 110 111
In summary, nc::capture_first_vec
takes a variable number of arguments:
To see the generated regular expression pattern string, call
nc::var_args_list
with the variable number of arguments that
specify the pattern:
nc::var_args_list(range.pattern)
#> $fun.list
#> $fun.list$chrom
#> function (x)
#> x
#> <bytecode: 0x0000000013333c60>
#> <environment: namespace:base>
#>
#> $fun.list$chromStart
#> function(x)as.integer(gsub("[^0-9]", "", x))
#> <bytecode: 0x0000000013b0c1f8>
#>
#> $fun.list$chromEnd
#> function(x)as.integer(gsub("[^0-9]", "", x))
#> <bytecode: 0x0000000013b0c1f8>
#>
#>
#> $pattern
#> [1] "(?:(chr.*?):([0-9,]+)(?:-([0-9,]+))?)"
The generated regex is the pattern
element of the resulting list
above. The other element fun.list
indicates the names and type
conversion functions to use with the capture groups.
The default is to stop with an error if any subject does not match:
bad.vec <- c(bad="does not match", pos.vec)
nc::capture_first_vec(bad.vec, range.pattern)
#> Error in stop_for_na(make.na): subject 1 did not match regex below; to output missing rows use nomatch.error=FALSE
#> (?:(chr.*?):([0-9,]+)(?:-([0-9,]+))?)
Sometimes you want to instead report a row of NA when a subject does
not match. In that case, use nomatch.error=FALSE
:
nc::capture_first_vec(bad.vec, range.pattern, nomatch.error=FALSE)
#> chrom chromStart chromEnd
#> 1: <NA> NA NA
#> 2: chr10 213054000 213055000
#> 3: chrM 111000 NA
#> 4: chr1 110 111
By default nc uses the PCRE regex engine. Other choices include ICU and RE2. Each engine has different features, which are discussed in my R journal paper.
The engine is configurable via the engine
argument or the
nc.engine
option:
u.subject <- "a\U0001F60E#"
u.pattern <- list(emoji="\\p{EMOJI_Presentation}")
old.opt <- options(nc.engine="ICU")
nc::capture_first_vec(u.subject, u.pattern)
#> emoji
#> 1: <U+0001F60E>
nc::capture_first_vec(u.subject, u.pattern, engine="PCRE")
#> Warning in regexpr(L$pattern, subject.vec, perl = TRUE): PCRE pattern compilation error
#> 'unknown property name after \P or \p'
#> at '}))'
#> Error in value[[3L]](cond): (?:(\p{EMOJI_Presentation}))
#> when matching pattern above with PCRE engine, an error occured: invalid regular expression '(?:(\p{EMOJI_Presentation}))'
nc::capture_first_vec(u.subject, u.pattern, engine="RE2")
#> Error in value[[3L]](cond): (?:(\p{EMOJI_Presentation}))
#> when matching pattern above with RE2 engine, an error occured: bad character class range: \p{EMOJI_Presentation}
options(old.opt)
We also provide nc::capture_first_df
which extracts text
from several columns of a data.frame, using a different
regular expression for each column.
nc::capture_first_vec
on one
column of the input data.frame.nc::capture_first_vec
.nc::capture_first_vec
, in list/character/function format as
explained in the previous section.This function can greatly simplify the code required to create numeric data columns from character data columns. For example consider the following data which was output from the sacct program.
(sacct.df <- data.frame(
Elapsed = c(
"07:04:42", "07:04:42", "07:04:49",
"00:00:00", "00:00:00"),
JobID=c(
"13937810_25",
"13937810_25.batch",
"13937810_25.extern",
"14022192_[1-3]",
"14022204_[4]"),
stringsAsFactors=FALSE))
#> Elapsed JobID
#> 1 07:04:42 13937810_25
#> 2 07:04:42 13937810_25.batch
#> 3 07:04:49 13937810_25.extern
#> 4 00:00:00 14022192_[1-3]
#> 5 00:00:00 14022204_[4]
Say we want to filter by the total Elapsed time (which is reported as hours:minutes:seconds), and base job id (which is the number before the underscore in the JobID column). We could start by converting those character columns to integers via:
int.pattern <- list("[0-9]+", as.integer)
range.pattern <- list(
"\\[",
task1=int.pattern,
list(
"-",#begin optional end of range.
taskN=int.pattern
), "?", #end is optional.
"\\]")
nc::capture_first_df(sacct.df, JobID=range.pattern, nomatch.error=FALSE)
#> Elapsed JobID task1 taskN
#> 1: 07:04:42 13937810_25 NA NA
#> 2: 07:04:42 13937810_25.batch NA NA
#> 3: 07:04:49 13937810_25.extern NA NA
#> 4: 00:00:00 14022192_[1-3] 1 3
#> 5: 00:00:00 14022204_[4] 4 NA
The result shown above is another data frame with an additional column for each capture group. Next, we define another pattern that matches either one task ID or the previously defined range pattern:
task.pattern <- list(
"_",
list(
task=int.pattern,
"|",#either one task(above) or range(below)
range.pattern))
nc::capture_first_df(sacct.df, JobID=task.pattern)
#> Elapsed JobID task task1 taskN
#> 1: 07:04:42 13937810_25 25 NA NA
#> 2: 07:04:42 13937810_25.batch 25 NA NA
#> 3: 07:04:49 13937810_25.extern 25 NA NA
#> 4: 00:00:00 14022192_[1-3] NA 1 3
#> 5: 00:00:00 14022204_[4] NA 4 NA
Below we match the complete JobID column:
job.pattern <- list(
job=int.pattern,
task.pattern,
list(
"[.]",
type=".*"
), "?")
nc::capture_first_df(sacct.df, JobID=job.pattern)
#> Elapsed JobID job task task1 taskN type
#> 1: 07:04:42 13937810_25 13937810 25 NA NA
#> 2: 07:04:42 13937810_25.batch 13937810 25 NA NA batch
#> 3: 07:04:49 13937810_25.extern 13937810 25 NA NA extern
#> 4: 00:00:00 14022192_[1-3] 14022192 NA 1 3
#> 5: 00:00:00 14022204_[4] 14022204 NA 4 NA
Below we match the Elapsed column with a different regex:
elapsed.pattern <- list(
hours=int.pattern,
":",
minutes=int.pattern,
":",
seconds=int.pattern)
nc::capture_first_df(sacct.df, JobID=job.pattern, Elapsed=elapsed.pattern)
#> Elapsed JobID job task task1 taskN type hours
#> 1: 07:04:42 13937810_25 13937810 25 NA NA 7
#> 2: 07:04:42 13937810_25.batch 13937810 25 NA NA batch 7
#> 3: 07:04:49 13937810_25.extern 13937810 25 NA NA extern 7
#> 4: 00:00:00 14022192_[1-3] 14022192 NA 1 3 0
#> 5: 00:00:00 14022204_[4] 14022204 NA 4 NA 0
#> minutes seconds
#> 1: 4 42
#> 2: 4 42
#> 3: 4 49
#> 4: 0 0
#> 5: 0 0
Overall the result is another data table with an additional column for each capture group.