The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
regextable extracts regex-based pattern matches from a
data frame or character vector using a pattern lookup table. For each
input row, all matching patterns are returned, along with the matched
substring, an internal row identifier, and additional columns specified
in data_return_cols and regex_return_cols.
Optional metadata from the pattern table can also be included. Multiple
rows may be returned for a single text if it matches multiple
patterns.
Install and load the package:
For demonstration, we use two included datasets:
members: A lookup table of regex patterns for member
names.cr2007_03_01: A sample text dataset to search.| congress | chamber | bioname | pattern | icpsr | state_abbrev | district_code | first_name | last_name |
|---|---|---|---|---|---|---|---|---|
| 110 | President | BUSH, George Walker | george bush|george walker bush|bush|george w bush|bush|(^|senator |representative )bush|bush, george|bush george|bush, g|president bush|g w bush | 99910 | USA | 0 | George | BUSH |
| 110 | House | BONNER, Jr., Josiah Robins (Jo) | josiah bonner|josiah josiah robins bonner|bonner|josiah j bonner|jo bonner|jo josiah robins bonner|jo j bonner|(^|senator |representative )bonner|bonner, jo|bonner, josiah|bonner josiah|bonner, j|representative bonner|j j bonner | 20300 | AL | 1 | Josiah | BONNER |
| 110 | House | ROGERS, Mike Dennis | mike rogers|mike dennis rogers|rogers.{1,4}al|mike d rogers|michael rogers|michael dennis rogers|michael d rogers|(^|senator |representative )rogers{1,4}al|rogers, michael|rogers, mike|rogers mike|representative rogers{1,4}al|m d rogers | 20301 | AL | 3 | Mike | ROGERS |
| 110 | House | DAVIS, Artur | artur davis|davis|(^|senator |representative )davis{1,4}al|davis, artur|davis artur|davis, a|representative davis{1,4}al | 20302 | AL | 7 | Artur | DAVIS |
| 110 | House | CRAMER, Robert E. (Bud), Jr. | robert cramer|robert e cramer|cramer|bud cramer|bud e cramer|cramer|(^|senator |representative )cramer|cramer, bud|cramer, robert|cramer robert|cramer, r|cramer, b|representative cramer|r e cramer | 29100 | AL | 5 | Robert | CRAMER |
| date | text | header |
|---|---|---|
| 2007-03-01 | HON. SAM GRAVES;Mr. GRAVES | RECOGNIZING JARRETT MUCK FOR ACHIEVING THE RANK OF EAGLE SCOUT; Congressional Record Vol. 153, No. 35 |
| 2007-03-01 | HON. MARK UDALL;Mr. UDALL | INTRODUCING A CONCURRENT RESOLUTION HONORING THE 50TH ANNIVERSARY OF THE INTERNATIONAL GEOPHYSICAL YEAR (IGY); Congressional Record Vol. 153, No. 35 |
| 2007-03-01 | HON. JAMES R. LANGEVIN;Mr. LANGEVIN | BIOSURVEILLANCE ENHANCEMENT ACT OF 2007; Congressional Record Vol. 153, No. 35 |
| 2007-03-01 | HON. JIM COSTA;Mr. COSTA | A TRIBUTE TO THE LIFE OF MRS. VERNA DUTY; Congressional Record Vol. 153, No. 35 |
| 2007-03-01 | HON. SAM GRAVES;Mr. GRAVES | RECOGNIZING JARRETT MUCK FOR ACHIEVING THE RANK OF EAGLE SCOUT |
The simplest use of extract():
result <- extract(
data = cr2007_03_01,
regex_table = members,
data_return_cols = c("text"),
regex_return_cols = c("icpsr")
)
kable(head(result))| row_id | text | icpsr | pattern | match |
|---|---|---|---|---|
| 1 | HON. SAM GRAVES;Mr. GRAVES | 20124 | samuel graves|graves|sam graves|(^|senator |representative )graves|graves, sam|graves, samuel|graves samuel|graves, s|representative graves/td> | SAM GRAVES |
| 2 | HON. MARK UDALL;Mr. UDALL | 29906 | mark udall|udall|mark e udall|udall|(^|senator |representative )udall{1,4}co|udall, mark|udall mark|udall, m|representative udall{1,4}co|m e udall | MARK UDALL |
| 3 | HON. JAMES R. LANGEVIN;Mr. LANGEVIN | 20136 | james langevin|langevin|james r langevin|jim langevin|jim r langevin|(^|senator |representative )langevin|langevin, jim|langevin, james|langevin james|langevin, j|representative langevin|j r langevin | james r langevin |
| 4 | HON. JIM COSTA;Mr. COSTA | 20501 | jim costa|costa|james costa|(^|senator |representative )costa|costa, james|costa, jim|costa jim|costa, j|representative costa/td> | JIM COSTA |
| 5 | HON. SAM GRAVES;Mr. GRAVES | 20124 | samuel graves|graves|sam graves|(^|senator |representative )graves|graves, sam|graves, samuel|graves samuel|graves, s|representative graves/td> | SAM GRAVES |
Explanation: - data: the text dataset to search. -
col_name: which column contains the text. -
regex_table: the lookup table of patterns. -
data_return_cols: additional columns from data
to include in the result. - regex_return_cols: additional
columns from the pattern table to attach. Each row in the output
corresponds to a detected match, and includes both the original text and
the matching pattern. —
extract() can also filter data by date, remove acronyms
(all-uppercase patterns with 2+ characters), and select specific output
columns. This is useful for more controlled extraction.
Explanation: - date_col, date_start,
date_end: filter rows by date. -
remove_acronyms: skip patterns like “NASA” or “USA”. You
can combine these filters with any subset of columns for flexible
outputs. —
extract() supports parallel processing via the
cl parameter:
regextable is a tool for extracting data from
text.extract() by default handles text cleaning and
efficient matching.These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.