Introduction to epubr

The epubr package provides functions supporting the reading and parsing of internal e-book content from EPUB files. E-book metadata and text content are parsed separately and joined together in a tidy, nested tibble data frame.

E-book formatting is non-standard enough across all literature that no function can curate parsed e-book content across an arbitrary collection of e-books, in completely general form, resulting in a singular, consistently formatted output containing all the same variables.

EPUB file parsing functionality in this package is intended for relatively general application to arbitrary EPUB e-books. However, poorly formatted e-books or e-books with highly uncommon formatting may not work with this package. There may even be cases where an EPUB file has DRM or some other property that makes it impossible to read with epubr.

Text is read ‘as is’. Additional text cleaning should be performed by the user at their discretion, such as with functions from packages like tm or qdap.

Read EPUB files

Bram Stoker’s Dracula novel sourced from Project Gutenberg is a good example of an EPUB file with unfortunate formatting. The first thing that stands out is the naming convention using item followed by some ordered digits does not differentiate sections like the book preamble from the chapters. The numbering also starts in a weird place. But it is actually worse than this. Notice that sections are not broken into chapters; they can begin and end in the middle of chapters!

These annoyances aside, the metadata and contents can still be read into a convenient table. Text mining analyses can still be performed on the overall book, if not so easily on individual chapters.

Here a single file is read with epub. The output of the returned primary data frame and the book text data frame that is nested within its data column are shown.

library(epubr)
file <- system.file("dracula.epub", package = "epubr")
(x <- epub(file))
#> # A tibble: 1 x 9
#>   rights  identifier   creator  title language subject date  source  data 
#>   <chr>   <chr>        <chr>    <chr> <chr>    <chr>   <chr> <chr>   <lis>
#> 1 Public~ http://www.~ Bram St~ Drac~ en       Horror~ 1995~ http:/~ <tib~

x$data[[1]]
#> # A tibble: 15 x 4
#>    section       text                                          nword nchar
#>    <chr>         <chr>                                         <int> <int>
#>  1 item6         "The Project Gutenberg EBook of Dracula, by ~ 11252 60972
#>  2 item7         "But I am not in heart to describe beauty, f~ 13740 71798
#>  3 item8         "“ ‘Lucy, you are an honest-hearted girl, I ~ 12356 65522
#>  4 item9         "CHAPTER VIIIMINA MURRAY’S JOURNAL\nSame day~ 12042 62724
#>  5 item10        "CHAPTER X\nLetter, Dr. Seward to Hon. Arthu~ 12599 66678
#>  6 item11        "Once again we went through that ghastly ope~ 11919 62949
#>  7 item12        "CHAPTER XIVMINA HARKER’S JOURNAL\n23 Septem~ 12003 62234
#>  8 item13        "CHAPTER XVIDR. SEWARD’S DIARY—continued\nIT~ 13812 72903
#>  9 item14        "“Thus when we find the habitation of this m~ 13201 69779
#> 10 item15        "“I see,” I said. “You want big things that ~ 12706 66921
#> 11 item16        "CHAPTER XXIIIDR. SEWARD’S DIARY\n3 October.~ 11818 61550
#> 12 item17        "CHAPTER XXVDR. SEWARD’S DIARY\n11 October, ~ 12989 68564
#> 13 item18        " \nLater.—Dr. Van Helsing has returned. He ~  8356 43464
#> 14 item19        "End of the Project Gutenberg EBook of Dracu~  2669 18541
#> 15 coverpage-wr~ ""                                                0     0

The file argument may be a vector of EPUB files. There is one row for each book.

EPUB metadata

The above examples jump right in, but it can be helpful to inspect file metadata before reading a large number of books into memory. Formatting may differ across books. It can be helpful to know what fields to expect, the degree of consistency, and what content you may want to drop during the file reading process. epub_meta strictly parses file metadata and does not read the e-book text.

epub_meta(file)
#> # A tibble: 1 x 8
#>   rights    identifier    creator  title language subject date  source    
#>   <chr>     <chr>         <chr>    <chr> <chr>    <chr>   <chr> <chr>     
#> 1 Public d~ http://www.g~ Bram St~ Drac~ en       Horror~ 1995~ http://ww~

This provides the big picture, though it will not reveal the internal breakdown of book section naming conventions that were seen in the first epub example.

file can also be a vector for epub_meta. Whenever file is a vector, the fields (columns) returned are the union of all fields detected across all EPUB files. Any books (rows) that do not have a field found in another book return NA for that row and column.

Additonal arguments

There are three optional arguments that can be provided to epub to:

Unless you have a collection of well-formatted and similarly formatted EPUB files, these arguments may not be helpful and can be ignored, especially chapter detection.

Select fields

Selecting fields is straightforward. All fields found are returned unless a vector of fields is provided.

epub(file, fields = c("title", "creator", "file"))
#> # A tibble: 1 x 4
#>   title   creator     file         data             
#>   <chr>   <chr>       <chr>        <list>           
#> 1 Dracula Bram Stoker dracula.epub <tibble [15 x 4]>

Note that file was not a field identified in the metadata. This is a special case. Including file will include the basename of the input file. This is helpful when you want to retain file names and source is included in the metadata but may represent something else. Some fields like data and title are always returned and do not need to be specified in fields.

Also, if your e-book does not have a metadata field named title, you can pass an additional argument to ... to map a different, known metadata field to title. E.g., title = "BookTitle". The resulting table always has a title field, but in this case title would be populated with information from the BookTitle metadata field. If the default title field or any other field name passed to the additional title argument does not exist in the file metadata, the output title column falls back on filling in with the same unique file names obtained when requesting the file field.

Drop sections

Filtering out unwanted sections, or rows of the nested data frame, uses a regular expression pattern. Matched rows are dropped. This is where knowing the naming conventions used in the e-books in file, or at least knowing they are satisfactorily consistent and predictable for a collection, helps with removing extraneous clutter.

One section that can be discarded is the cover. For many books it can be helpful to use a pattern like "^(C|c)ov" to drop any sections whose IDs begin with Cov, cov, and may be that abbreviation or the full word. For this book, cov suffices. The nested data frame has one less row than before.

epub(file, drop_sections = "cov")$data[[1]]
#> # A tibble: 14 x 4
#>    section text                                                nword nchar
#>    <chr>   <chr>                                               <int> <int>
#>  1 item6   "The Project Gutenberg EBook of Dracula, by Bram S~ 11252 60972
#>  2 item7   "But I am not in heart to describe beauty, for whe~ 13740 71798
#>  3 item8   "“ ‘Lucy, you are an honest-hearted girl, I know. ~ 12356 65522
#>  4 item9   "CHAPTER VIIIMINA MURRAY’S JOURNAL\nSame day, 11 o~ 12042 62724
#>  5 item10  "CHAPTER X\nLetter, Dr. Seward to Hon. Arthur Holm~ 12599 66678
#>  6 item11  "Once again we went through that ghastly operation~ 11919 62949
#>  7 item12  "CHAPTER XIVMINA HARKER’S JOURNAL\n23 September.—J~ 12003 62234
#>  8 item13  "CHAPTER XVIDR. SEWARD’S DIARY—continued\nIT was j~ 13812 72903
#>  9 item14  "“Thus when we find the habitation of this man-tha~ 13201 69779
#> 10 item15  "“I see,” I said. “You want big things that you ca~ 12706 66921
#> 11 item16  "CHAPTER XXIIIDR. SEWARD’S DIARY\n3 October.—The t~ 11818 61550
#> 12 item17  "CHAPTER XXVDR. SEWARD’S DIARY\n11 October, Evenin~ 12989 68564
#> 13 item18  " \nLater.—Dr. Van Helsing has returned. He has go~  8356 43464
#> 14 item19  "End of the Project Gutenberg EBook of Dracula, by~  2669 18541

Guess chapters

This e-book unfortunately does not have great formatting. For the sake of example, pretend that chapters are known to be sections beginning with item and followed by two digits, using the pattern ^item\\d\\d. This does two things. It adds a new metadata column to the primary data frame called nchap giving the estimated number of chapters in the book. In the nested data frame containing the parsed e-book text, the section column is conditionally mutated to reflect a new, consistent chapter naming convention for the identified chapters and a logical is_chapter column is added.

x <- epub(file, drop_sections = "cov", chapter_pattern = "^item\\d\\d")
x
#> # A tibble: 1 x 10
#>   rights identifier creator title language subject date  source nchap data
#>   <chr>  <chr>      <chr>   <chr> <chr>    <chr>   <chr> <chr>  <int> <li>
#> 1 Publi~ http://ww~ Bram S~ Drac~ en       Horror~ 1995~ http:~    10 <ti~

x$data[[1]]
#> # A tibble: 14 x 5
#>    section text                                     is_chapter nword nchar
#>    <chr>   <chr>                                    <lgl>      <int> <int>
#>  1 item6   "The Project Gutenberg EBook of Dracula~ FALSE      11252 60972
#>  2 item7   "But I am not in heart to describe beau~ FALSE      13740 71798
#>  3 item8   "“ ‘Lucy, you are an honest-hearted gir~ FALSE      12356 65522
#>  4 item9   "CHAPTER VIIIMINA MURRAY’S JOURNAL\nSam~ FALSE      12042 62724
#>  5 ch01    "CHAPTER X\nLetter, Dr. Seward to Hon. ~ TRUE       12599 66678
#>  6 ch02    "Once again we went through that ghastl~ TRUE       11919 62949
#>  7 ch03    "CHAPTER XIVMINA HARKER’S JOURNAL\n23 S~ TRUE       12003 62234
#>  8 ch04    "CHAPTER XVIDR. SEWARD’S DIARY—continue~ TRUE       13812 72903
#>  9 ch05    "“Thus when we find the habitation of t~ TRUE       13201 69779
#> 10 ch06    "“I see,” I said. “You want big things ~ TRUE       12706 66921
#> 11 ch07    "CHAPTER XXIIIDR. SEWARD’S DIARY\n3 Oct~ TRUE       11818 61550
#> 12 ch08    "CHAPTER XXVDR. SEWARD’S DIARY\n11 Octo~ TRUE       12989 68564
#> 13 ch09    " \nLater.—Dr. Van Helsing has returned~ TRUE        8356 43464
#> 14 ch10    "End of the Project Gutenberg EBook of ~ TRUE        2669 18541

Also note that not all books have chapters. Make sure an optional argument makes sense to use with a given e-book.

Some e-books have formatting that puts chapter sections completely out of order even when they may be easily separable from other book sections and this can be another roadblock, as you may correctly identify and distinguish chapters from other book sections like cover, title, copyright and acknowledgements pages, but you will number the chapters incorrectly. Other e-books do not even split the text into sections based on natural breaks in the original text like chapters, but rather are split into sections at arbitrary break points in the complete text.

Ultimately, everything depends on the quality of the EPUB file. Some publishers are better than others. Formatting standards may also change over time.

Unzip EPUB file

Separate from using epub_meta and epub, you can call epub_unzip directly if all you want to do is extract the files from the .epub file archive. By default the archive files are extracted to tempdir() so you may want to change this with the exdir argument.

bookdir <- file.path(tempdir(), "dracula")
epub_unzip(file, exdir = bookdir)
list.files(bookdir, recursive = TRUE)
#>  [1] "META-INF/container.xml"                                                  
#>  [2] "OEBPS/0.css"                                                             
#>  [3] "OEBPS/1.css"                                                             
#>  [4] "OEBPS/@public@vhost@g@gutenberg@html@files@345@345-h@345-h-0.htm.html"   
#>  [5] "OEBPS/@public@vhost@g@gutenberg@html@files@345@345-h@345-h-1.htm.html"   
#>  [6] "OEBPS/@public@vhost@g@gutenberg@html@files@345@345-h@345-h-10.htm.html"  
#>  [7] "OEBPS/@public@vhost@g@gutenberg@html@files@345@345-h@345-h-11.htm.html"  
#>  [8] "OEBPS/@public@vhost@g@gutenberg@html@files@345@345-h@345-h-12.htm.html"  
#>  [9] "OEBPS/@public@vhost@g@gutenberg@html@files@345@345-h@345-h-13.htm.html"  
#> [10] "OEBPS/@public@vhost@g@gutenberg@html@files@345@345-h@345-h-2.htm.html"   
#> [11] "OEBPS/@public@vhost@g@gutenberg@html@files@345@345-h@345-h-3.htm.html"   
#> [12] "OEBPS/@public@vhost@g@gutenberg@html@files@345@345-h@345-h-4.htm.html"   
#> [13] "OEBPS/@public@vhost@g@gutenberg@html@files@345@345-h@345-h-5.htm.html"   
#> [14] "OEBPS/@public@vhost@g@gutenberg@html@files@345@345-h@345-h-6.htm.html"   
#> [15] "OEBPS/@public@vhost@g@gutenberg@html@files@345@345-h@345-h-7.htm.html"   
#> [16] "OEBPS/@public@vhost@g@gutenberg@html@files@345@345-h@345-h-8.htm.html"   
#> [17] "OEBPS/@public@vhost@g@gutenberg@html@files@345@345-h@345-h-9.htm.html"   
#> [18] "OEBPS/@public@vhost@g@gutenberg@html@files@345@345-h@images@colophon.png"
#> [19] "OEBPS/@public@vhost@g@gutenberg@html@files@345@345-h@images@cover.jpg"   
#> [20] "OEBPS/content.opf"                                                       
#> [21] "OEBPS/pgepub.css"                                                        
#> [22] "OEBPS/toc.ncx"                                                           
#> [23] "OEBPS/wrap0000.html"                                                     
#> [24] "mimetype"