The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Optical Character Recognition with imagerExtara

Shota Ochi

2019-01-25

You need the R package tesseract, which is bindings to a powerful optical character recognition (OCR) engine, to do OCR with imagerExtra.

See the installation guide of tesseract if you haven’t installed tesseract.

ocr function of tesseract works best for images with high contrast, little noise, and horizontal text.

ocr function doesn’t show a good performance for degraded images as shown below.

library(imagerExtra)
plot(papers, main = "Original")

OCR(papers) %>% print
[1] ""
OCR_data(papers) %>% print
[1] word       confidence bbox      
<0 rows> (or 0-length row.names)

OCR function and OCR_data function are wrappers for ocr function and ocr_data function of tesseract.

We can see OCR function and OCR_data function failed to recognize the text “Hello”.

We need to clean the image before using OCR function.

hello <- DenoiseDCT(papers, 0.01) %>% ThresholdAdaptive(., 0.1, range = c(0,1))
plot(hello, main = "Hello")

OCR(hello) %>% print
[1] "Hello\n"
OCR_data(hello) %>% print
   word confidence       bbox
1 Hello   93.99038 8,9,118,54

We can see the text “Hello” was recognized.

Using tesseract in combination with imagerExtra enables us to extract text from degraded images.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.