This package defines a few useful functions for keyword searching using the pdftools package developed by rOpenSci.
There are currently two functions in this package of use to users. The first keyword_search
takes a single pdf and searches for keywords from the pdf. The second keyword_directory
does the same search over a directory of pdfs.
keyword_search
ExampleThe package comes with two pdf files from arXiv to use as test cases. Below is an example of using the keyword_search
function.
library(pdfsearch)
file <- system.file('pdf', '1610.00147.pdf', package = 'pdfsearch')
result <- keyword_search(file,
keyword = c('measurement', 'error'),
path = TRUE)
head(result)
#> # A tibble: 6 x 5
#> keyword page_num line_num line_text token_text
#> <chr> <int> <int> <list> <list>
#> 1 measurement 1 5 <chr [1]> <list [1]>
#> 2 measurement 1 9 <chr [1]> <list [1]>
#> 3 measurement 1 19 <chr [1]> <list [1]>
#> 4 measurement 1 21 <chr [1]> <list [1]>
#> 5 measurement 2 28 <chr [1]> <list [1]>
#> 6 measurement 2 31 <chr [1]> <list [1]>
head(result$line_text, n = 2)
#> [[1]]
#> [1] "Often in surveys, key items are subject to measurement errors. Given just the"
#>
#> [[2]]
#> [1] "with high quality measurements of the error-prone survey items. We"
The location of the keyword match, including page number and line number, and the actual line of text are returned by default.
It may be useful to extract not just the line of text that the keyword is found in, but also surrounding text to have additional context when looking at the keyword results. This can be added by using the argument surround_lines
as follows:
file <- system.file('pdf', '1610.00147.pdf', package = 'pdfsearch')
result <- keyword_search(file,
keyword = c('measurement', 'error'),
path = TRUE, surround_lines = 1)
head(result)
#> # A tibble: 6 x 5
#> keyword page_num line_num line_text token_text
#> <chr> <int> <int> <list> <list>
#> 1 measurement 1 5 <chr [3]> <list [3]>
#> 2 measurement 1 9 <chr [3]> <list [3]>
#> 3 measurement 1 19 <chr [3]> <list [3]>
#> 4 measurement 1 21 <chr [3]> <list [3]>
#> 5 measurement 2 28 <chr [3]> <list [3]>
#> 6 measurement 2 31 <chr [3]> <list [3]>
head(result$line_text, n = 2)
#> [[1]]
#> [1] "Abstract"
#> [2] "Often in surveys, key items are subject to measurement errors. Given just the"
#> [3] "data, it can be difficult to determine the distribution of this error process, and"
#>
#> [[2]]
#> [1] "some settings, however, analysts have access to a data source on different individuals"
#> [2] "with high quality measurements of the error-prone survey items. We"
#> [3] "present a data fusion framework for leveraging this information to improve inferences"
Typeset PDF files commonly contain words that wrap from one line to the next and are hyphenated. An example of this is shown in the following image.
hyphenated example
Any hyphenated words are treated as two words and the keyword search may not perform as desired if a matching word would be returned if it is not hyphenated. Fortunately, there is a remove_hyphen
argument within the keyword_search
function that removes the hyphenated words at the end of a line and combines them with the word on the next line in the document. Below is an example of this working, showing the results before and after using the remove_hyphen
argument. By default this argument is set to TRUE.
file <- system.file('pdf', '1610.00147.pdf', package = 'pdfsearch')
result_hyphen <- keyword_search(file,
keyword = c('measurement'),
path = TRUE, remove_hyphen = FALSE)
result_remove_hyphen <- keyword_search(file,
keyword = c('measurement'),
path = TRUE, remove_hyphen = TRUE)
nrow(result_hyphen)
#> [1] 37
nrow(result_remove_hyphen)
#> [1] 41
You’ll notice that the removal of the hyphen added a few additional keyword matches to the results. These were cases where the word “measurement” wrapped across two lines and was hyphenated (see the image above that has an example of this).
One specific note about removing hyphens in multiple column PDF files. The ability of the function to perform this action is still experimental and many times does not work the best as of yet. Use the remove_hyphen
argument with caution with multiple column PDF files.
Using the tokenizers R package, it is also possible to split the document into individual words. This may be most useful when the interest is in performing a text analysis rather than a keyword search. Below is an example showing the first page of the text converted to words. By default, hyphenated words at the end of the lines are removed (see previous section for description of this).
token_result <- convert_tokens(file, path = TRUE)[[1]]
head(token_result)
#> [[1]]
#> [1] "data" "fusion" "for" "correcting"
#> [5] "measurement" "errors" "tracy" "schifeling"
#> [9] "jerome" "p" "reiter" "maria"
#> [13] "deyoreo" "arxiv" "1610.00147v1" "stat.me"
#> [17] "1" "oct" "2016" "abstract"
#> [21] "often" "in" "surveys" "key"
#> [25] "items" "are" "subject" "to"
#> [29] "measurement" "errors" "given" "just"
#> [33] "the" "data" "it" "can"
#> [37] "be" "difficult" "to" "determine"
#> [41] "the" "distribution" "of" "this"
#> [45] "error" "process" "and" "hence"
#> [49] "to" "obtain" "accurate" "inferences"
#> [53] "that" "involve" "the" "error"
#> [57] "prone" "variables" "in" "some"
#> [61] "settings" "however" "analysts" "have"
#> [65] "access" "to" "a" "data"
#> [69] "source" "on" "different" "in"
#> [73] "dividuals" "with" "high" "quality"
#> [77] "measurements" "of" "the" "error"
#> [81] "prone" "survey" "items" "we"
#> [85] "present" "a" "data" "fusion"
#> [89] "framework" "for" "leveraging" "this"
#> [93] "information" "to" "improve" "infer"
#> [97] "ences" "in" "the" "error"
#> [101] "prone" "survey" "the" "basic"
#> [105] "idea" "is" "to" "posit"
#> [109] "models" "about" "the" "rates"
#> [113] "at" "which" "individuals" "make"
#> [117] "errors" "coupled" "with" "models"
#> [121] "for" "the" "values" "reported"
#> [125] "when" "errors" "are" "made"
#> [129] "this" "can" "avoid" "the"
#> [133] "unrealistic" "assumption" "of" "conditional"
#> [137] "independence" "typically" "used" "in"
#> [141] "data" "fusion" "we" "apply"
#> [145] "the" "approach" "on" "the"
#> [149] "re" "ported" "values" "of"
#> [153] "educational" "attainments" "in" "the"
#> [157] "american" "community" "survey" "using"
#> [161] "the" "national" "survey" "of"
#> [165] "college" "graduates" "as" "the"
#> [169] "high" "quality" "data" "source"
#> [173] "in" "doing" "so" "we"
#> [177] "account" "for" "the" "informative"
#> [181] "sampling" "design" "used" "to"
#> [185] "select" "the" "national" "survey"
#> [189] "of" "college" "graduates" "we"
#> [193] "also" "present" "a" "process"
#> [197] "for" "assessing" "the" "sensitivity"
#> [201] "of" "various" "analyses" "to"
#> [205] "different" "choices" "for" "the"
#> [209] "measurement" "error" "models" "supplemental"
#> [213] "material" "is" "available" "online"
#> [217] "key" "words" "fusion" "imputation"
#> [221] "measurement" "error" "missing" "survey"
#> [225] "this" "research" "was" "supported"
#> [229] "by" "the" "national" "science"
#> [233] "foundation" "under" "award" "ses"
#> [237] "11" "31897" "the" "authors"
#> [241] "wish" "to" "thank" "seth"
#> [245] "sanders" "for" "his" "input"
#> [249] "on" "informative" "prior" "specifications"
#> [253] "and" "mauricio" "sadinle" "for"
#> [257] "discussion" "that" "improved" "the"
#> [261] "strategy" "for" "accounting" "for"
#> [265] "the" "informative" "sample" "design"
#> [269] "1"
Another implementation of the convert_tokens
function, is to convert the result text to tokens. This could be interesting when used in tandem with the surround_lines argument for input into a text analysis. These tokens are included by default when calling the keyword_search
function.
file <- system.file('pdf', '1501.00450.pdf', package = 'pdfsearch')
result <- keyword_search(file,
keyword = c('repeated measures', 'mixed effects'),
path = TRUE, surround_lines = 1)
result
#> # A tibble: 10 x 5
#> keyword page_num line_num line_text token_text
#> <chr> <int> <int> <list> <list>
#> 1 repeated measures 1 24 <chr [3]> <list [3]>
#> 2 repeated measures 2 57 <chr [3]> <list [3]>
#> 3 repeated measures 2 108 <chr [3]> <list [3]>
#> 4 repeated measures 2 110 <chr [3]> <list [3]>
#> 5 repeated measures 2 125 <chr [3]> <list [3]>
#> 6 repeated measures 6 444 <chr [3]> <list [3]>
#> 7 repeated measures 6 445 <chr [3]> <list [3]>
#> 8 repeated measures 6 474 <chr [3]> <list [3]>
#> 9 repeated measures 6 485 <chr [3]> <list [3]>
#> 10 repeated measures 9 708 <chr [3]> <list [3]>
keyword_directory
ExampleThe keyword_directory
function is useful when you have a directory of many pdf files that you want to search a series of keywords in a single function call. This can be particularly useful in the context of a research synthesis or to screen studies for characteristics to include in a meta-analysis.
There are two files that come with the package from ArXiv in a single directory that will be used as an example use case for the package.
directory <- system.file('pdf', package = 'pdfsearch')
result <- keyword_directory(directory,
keyword = c('repeated measures', 'mixed effects',
'error'),
surround_lines = 1, full_names = TRUE)
head(result)
#> ID pdf_name keyword page_num line_num
#> 1 1 1501.00450.pdf repeated measures 1 24
#> 2 1 1501.00450.pdf repeated measures 2 57
#> 3 1 1501.00450.pdf repeated measures 2 108
#> 4 1 1501.00450.pdf repeated measures 2 110
#> 5 1 1501.00450.pdf repeated measures 2 125
#> 6 1 1501.00450.pdf repeated measures 6 444
#> line_text
#> 1 introduce more sophisticated experimental designs, specifi- only would we miss potentially beneficial effects, we may also, cally the repeated measures design, including the crossover get false confidence about lack of negative effects. Statistical, design and related variants, to increase KPI sensitivity with power increases with larger effect size, and smaller variances.
#> 2 a limitation to any online experimentation platform, where within-subject variation. We also discuss practical considfast, iterations and testing many ideas can reap the most erations to repeated measures design, with variants to the, rewards. crossover design to study the carry over effect, including the
#> 3 In this paper we extend the idea further by employing the weeks. To facilitate our illustration, in all the derivation, repeated measures design in different stages of treatment in this section we assume all users appear in all periods,, assignment. The traditional A/B test can be analyzed us- i.e. no missing measurement. We also restrict ourselves
#> 4 assignment. The traditional A/B test can be analyzed us- i.e. no missing measurement. We also restrict ourselves, ing the repeated measures analysis, reporting a “per week” to metrics that are defined as simple average and assume, treatment effect, as show in row 3 “parallel” design in ta- treatment and control have the same sample size. We furble
#> 5 each user serves as his/her own control in the measurement. fixed effects in the model in this section. This way, various, In fact, the crossover design is a type of repeated measures designs considered can be examined in the same framework, design commonly used in biomedical research to control for and easily compared.
#> 6 to realize infrequent users are more likely to have missing 5.1 Review of Existing Methods, values and the absence in a specific time window can still It is common to analyze data from repeated measures design, provide information on the user behavior and in reality there with the repeated measures ANOVA model and the F-test,
#> token_text
#> 1 introduce, more, sophisticated, experimental, designs, specifi, only, would, we, miss, potentially, beneficial, effects, we, may, also, cally, the, repeated, measures, design, including, the, crossover, get, false, confidence, about, lack, of, negative, effects, statistical, design, and, related, variants, to, increase, kpi, sensitivity, with, power, increases, with, larger, effect, size, and, smaller, variances
#> 2 a, limitation, to, any, online, experimentation, platform, where, within, subject, variation, we, also, discuss, practical, considfast, iterations, and, testing, many, ideas, can, reap, the, most, erations, to, repeated, measures, design, with, variants, to, the, rewards, crossover, design, to, study, the, carry, over, effect, including, the
#> 3 in, this, paper, we, extend, the, idea, further, by, employing, the, weeks, to, facilitate, our, illustration, in, all, the, derivation, repeated, measures, design, in, different, stages, of, treatment, in, this, section, we, assume, all, users, appear, in, all, periods, assignment, the, traditional, a, b, test, can, be, analyzed, us, i.e, no, missing, measurement, we, also, restrict, ourselves
#> 4 assignment, the, traditional, a, b, test, can, be, analyzed, us, i.e, no, missing, measurement, we, also, restrict, ourselves, ing, the, repeated, measures, analysis, reporting, a, per, week, to, metrics, that, are, defined, as, simple, average, and, assume, treatment, effect, as, show, in, row, 3, parallel, design, in, ta, treatment, and, control, have, the, same, sample, size, we, furble
#> 5 each, user, serves, as, his, her, own, control, in, the, measurement, fixed, effects, in, the, model, in, this, section, this, way, various, in, fact, the, crossover, design, is, a, type, of, repeated, measures, designs, considered, can, be, examined, in, the, same, framework, design, commonly, used, in, biomedical, research, to, control, for, and, easily, compared
#> 6 to, realize, infrequent, users, are, more, likely, to, have, missing, 5.1, review, of, existing, methods, values, and, the, absence, in, a, specific, time, window, can, still, it, is, common, to, analyze, data, from, repeated, measures, design, provide, information, on, the, user, behavior, and, in, reality, there, with, the, repeated, measures, anova, model, and, the, f, test
The full_names
argument is needed here to specify that the full file path needs to be used to access the pdf files. If the search is done directly from the repository (i.e. when using an R project in RStudio), then full_names
could be set to FALSE.
Currently there are a handful of limitations, mostly around how pdfs are read into R using the pdftools R package. When pdfs are created in a multiple column layout, a line in the pdf consists of the entire line across both columns. This can lead to fragmented text that may not give the full contents, even with using the surround_lines
argument.
Another limitation is when performing keyword searching with multiple words or phrases. If the match is on a single line, the match would be returned. However, if the words or phrase spans multiple lines, the current implementation will not return a result that spans multiple lines in the PDF file.
The package also has a simple Shiny app that can be called using the following command
run_shiny()