Introduction to tidypmc

The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Chris Stubben

August 26, 2024

The tidypmc package parses XML documents in the Open Access subset of Pubmed Central. Download the full text using pmc_xml.

R function	Description
`pmc_text`	Split section paragraphs into sentences with full path to subsection titles
`pmc_caption`	Split figure, table and supplementary material captions into sentences
`pmc_table`	Convert table nodes into a list of tibbles
`pmc_reference`	Format references cited into a tibble
`pmc_metadata`	List journal and article metadata in front node

pmc_text splits paragraphs into sentences and removes any tables, figures or formulas that are nested within paragraph tags, replaces superscripted references with brackets, adds carets and underscores to other superscripts and subscripts and includes the full path to the subsection title.

pmc_caption splits figure, table and supplementary material captions into sentences.

pmc_table formats tables by collapsing multiline headers, expanding rowspan and colspan attributes and adding subheadings into a new column.

Use collapse_rows to join column names and cell values in a semi-colon delimited string (and then search using functions in the next section).

pmc_reference extracts the id, pmid, authors, year, title, journal, volume, pages, and DOIs from reference tags.

Searching text

There are a few functions to search within the pmc_text or collapsed pmc_table output. separate_text uses the stringr package to extract any matching regular expression.

separate_text(txt, "[ATCGN]{5,}")
#  # A tibble: 9 × 5
#    match                 section                                             paragraph sentence text 
#    <chr>                 <chr>                                                   <int>    <int> <chr>
#  1 ACGCAATCGTTTTCNT      Results and Discussion; Computational discovery of…         2        3 A 16…
#  2 AAACGTTTNCGT          Results and Discussion; Computational discovery of…         2        4 It i…
#  3 TGATAATGATTATCATTATCA Results and Discussion; Computational discovery of…         2        5 A 21…
#  4 GATAATGATAATCATTATC   Results and Discussion; Computational discovery of…         2        6 It i…
#  5 TGANNNNNNTCAA         Results and Discussion; Computational discovery of…         2        7 A 15…
#  6 TTGATN                Results and Discussion; Computational discovery of…         2        8 It i…
#  7 NATCAA                Results and Discussion; Computational discovery of…         2        8 It i…
#  8 GTTAATTAA             Results and Discussion; Computational discovery of…         3        4 The …
#  9 GTTAATTAATGT          Results and Discussion; Computational discovery of…         3        5 An A…

A few wrappers search pre-defined patterns and add an extra step to expand matched ranges. separate_refs matches references within brackets using \\[[0-9, -]+\\] and expands ranges like [7-9].

x <- separate_refs(txt)
x
#  # A tibble: 93 × 6
#        id match section    paragraph sentence text                                                   
#     <dbl> <chr> <chr>          <int>    <int> <chr>                                                  
#   1     1 [1]   Background         1        1 Yersinia pestis is the etiological agent of plague, al…
#   2     2 [2]   Background         1        3 To produce a transmissible infection, Y. pestis coloni…
#   3     3 [3]   Background         1        9 However, a few bacilli are taken up by tissue macropha…
#   4     4 [4,5] Background         1       10 Residence in this niche also facilitates the bacteria'…
#   5     5 [4,5] Background         1       10 Residence in this niche also facilitates the bacteria'…
#   6     6 [6]   Background         2        1 A DNA microarray is able to determine simultaneous cha…
#   7     7 [7-9] Background         2        2 We and others have measured the gene expression profil…
#   8     8 [7-9] Background         2        2 We and others have measured the gene expression profil…
#   9     9 [7-9] Background         2        2 We and others have measured the gene expression profil…
#  10    10 [10]  Background         2        2 We and others have measured the gene expression profil…
#  # ℹ 83 more rows
filter(x, id == 8)
#  # A tibble: 5 × 6
#       id match           section                                             paragraph sentence text 
#    <dbl> <chr>           <chr>                                                   <int>    <int> <chr>
#  1     8 [7-9]           Background                                                  2        2 We a…
#  2     8 [8-13,15]       Background                                                  2        4 In o…
#  3     8 [7-13,15,19-21] Results and Discussion                                      2        1 Rece…
#  4     8 [7-9]           Results and Discussion; Virulence genes in respons…         3        1 As d…
#  5     8 [8-10]          Methods; Collection of microarray expression data           1        6 The …

separate_tags expands locus tag ranges.

collapse_rows(tab1, na="-") %>%
  separate_tags("YPO")
#  # A tibble: 270 × 5
#     id      match        table     row text                                                          
#     <chr>   <chr>        <chr>   <int> <chr>                                                         
#   1 YPO2439 YPO2439-2442 Table 1     1 subheading=Iron uptake or heme synthesis; Potential operon (r…
#   2 YPO2440 YPO2439-2442 Table 1     1 subheading=Iron uptake or heme synthesis; Potential operon (r…
#   3 YPO2441 YPO2439-2442 Table 1     1 subheading=Iron uptake or heme synthesis; Potential operon (r…
#   4 YPO2442 YPO2439-2442 Table 1     1 subheading=Iron uptake or heme synthesis; Potential operon (r…
#   5 YPO0279 YPO0279-0283 Table 1     2 subheading=Iron uptake or heme synthesis; Potential operon (r…
#   6 YPO0280 YPO0279-0283 Table 1     2 subheading=Iron uptake or heme synthesis; Potential operon (r…
#   7 YPO0281 YPO0279-0283 Table 1     2 subheading=Iron uptake or heme synthesis; Potential operon (r…
#   8 YPO0282 YPO0279-0283 Table 1     2 subheading=Iron uptake or heme synthesis; Potential operon (r…
#   9 YPO0283 YPO0279-0283 Table 1     2 subheading=Iron uptake or heme synthesis; Potential operon (r…
#  10 YPO1529 YPO1529-1532 Table 1     3 subheading=Iron uptake or heme synthesis; Potential operon (r…
#  # ℹ 260 more rows

Using `xml2`

The pmc_* functions use the xml2 package for parsing and may fail in some situations, so it helps to know how to parse xml_documents. Use cat and as.character to view nodes returned by xml_find_all.

library(xml2)
refs <- xml_find_all(doc, "//ref")
refs[1]
#  {xml_nodeset (1)}
#  [1] <ref id="B1">\n  <citation citation-type="journal">\n    <person-group person-group-type="aut ...
cat(as.character(refs[1]))
#  <ref id="B1">
#    <citation citation-type="journal">
#      <person-group person-group-type="author">
#        <name>
#          <surname>Perry</surname>
#          <given-names>RD</given-names>
#        </name>
#        <name>
#          <surname>Fetherston</surname>
#          <given-names>JD</given-names>
#        </name>
#      </person-group>
#      <article-title>Yersinia pestis--etiologic agent of plague</article-title>
#      <source>Clin Microbiol Rev</source>
#      <year>1997</year>
#      <volume>10</volume>
#      <fpage>35</fpage>
#      <lpage>66</lpage>
#      <pub-id pub-id-type="pmid">8993858</pub-id>
#    </citation>
#  </ref>

Many journals use superscripts for references cited so they usually appear after words like results9 below.

# doc1 <- pmc_xml("PMC6385181")
doc1 <- read_xml(system.file("extdata/PMC6385181.xml", package = "tidypmc"))
gsub(".*\\. ", "", xml_text( xml_find_all(doc1, "//sec/p"))[2])
#  [1] "RNA-seq identifies the most relevant genes and RT-qPCR validates its results9, especially in the field of environmental and host adaptation10,11 and antimicrobial response12."

Find the tags using xml_find_all and then update the nodes by adding brackets or other text.

bib <- xml_find_all(doc1, "//xref[@ref-type='bibr']")
bib[1]
#  {xml_nodeset (1)}
#  [1] <xref ref-type="bibr" rid="CR1">1</xref>
xml_text(bib) <- paste0(" [", xml_text(bib), "]")
bib[1]
#  {xml_nodeset (1)}
#  [1] <xref ref-type="bibr" rid="CR1"> [1]</xref>

The text is now separated from the reference. Note the pmc_text function adds the brackets by default.

gsub(".*\\. ", "", xml_text( xml_find_all(doc1, "//sec/p"))[2])
#  [1] "RNA-seq identifies the most relevant genes and RT-qPCR validates its results [9], especially in the field of environmental and host adaptation [10], [11] and antimicrobial response [12]."

Genes, species and many other terms are often included within italic tags. You can mark these nodes using the same code above or simply list all the names in italics and search text or tables for matches, for example three letter gene names in text below.

library(tibble)
x <- xml_name(xml_find_all(doc, "//*"))
tibble(tag=x) %>%
  count(tag, sort=TRUE)
#  # A tibble: 84 × 2
#     tag               n
#     <chr>         <int>
#   1 td              398
#   2 given-names     388
#   3 name            388
#   4 surname         388
#   5 italic          235
#   6 pub-id          129
#   7 tr              117
#   8 xref            108
#   9 year             80
#  10 article-title    77
#  # ℹ 74 more rows
it <- xml_text(xml_find_all(doc, "//sec//p//italic"), trim=TRUE)
it2 <- tibble(italic=it) %>%
  count(italic, sort=TRUE)
it2
#  # A tibble: 53 × 2
#     italic              n
#     <chr>           <int>
#   1 Y. pestis          46
#   2 in vitro            5
#   3 E. coli             4
#   4 psaEFABC            3
#   5 r                   3
#   6 Yersinia            2
#   7 Yersinia pestis     2
#   8 cis                 2
#   9 fur                 2
#  10 n                   2
#  # ℹ 43 more rows
filter(it2, nchar(italic) == 3)
#  # A tibble: 8 × 2
#    italic     n
#    <chr>  <int>
#  1 cis        2
#  2 fur        2
#  3 cys        1
#  4 hmu        1
#  5 ybt        1
#  6 yfe        1
#  7 yfu        1
#  8 ymt        1
separate_text(txt, c("fur", "cys", "hmu", "ybt", "yfe", "yfu", "ymt"))
#  # A tibble: 9 × 5
#    match section                                                             paragraph sentence text 
#    <chr> <chr>                                                                   <int>    <int> <chr>
#  1 ymt   Results and Discussion; Virulence genes in response to multiple en…         3        4 The …
#  2 fur   Results and Discussion; Clustering analysis and functional classif…         3        2 It i…
#  3 yfe   Results and Discussion; Clustering analysis and functional classif…         3        4 Gene…
#  4 hmu   Results and Discussion; Clustering analysis and functional classif…         3        4 Gene…
#  5 yfu   Results and Discussion; Clustering analysis and functional classif…         3        4 Gene…
#  6 ybt   Results and Discussion; Clustering analysis and functional classif…         3        4 Gene…
#  7 cys   Results and Discussion; Clustering analysis and functional classif…         4        2 Gene…
#  8 cys   Results and Discussion; Clustering analysis and functional classif…         4        3 Clus…
#  9 fur   Methods; Gel mobility shift analysis of Fur binding                         1        1 The …

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.

Introduction to tidypmc

Chris Stubben

August 26, 2024

Searching text

Using xml2

Using `xml2`