Ocular Historical Document Recognition System



미국 버클리(Berkeley) NLP 그룹에서 공개한 OCR 시스템을 활용한 고문헌 폰트 자동 추출 시스템이다. 기본적으로는 영어에 대한 폰트 추출을 지원하고 있는데, 기본 알고리즘을 활용하면 한자에 대한 폰트 추출도 가능할 것으로 보인다. 


역시 그다지 관심 영역은 아니라서....관심 있는 분들은 보시기를~~~


I am a huge fan of Ben Marwick. He has so many useful pieces of code for the programming archaeologist or historian!

Edit July 17 1.20 pm: Mea culpa: I originally titled this post, ‘Doing OCR within R’. But, what I’m describing below – that’s not OCR. That’s extracting text from pdfs. It’s very fast and efficient, but it’s not OCR. So, brain fart. But I leave the remainder of the post as it was. For command line OCR (really, actual OCR) on a Mac, see the link to Ben Schmidt’s piece at the bottom. Sorry.

Edit July 17 10 pm: I am now an even bigger fan of Ben’s. He’s updated his script to either a) perform OCR by calling Tesseract from within R or b) grab the text layer from a pdf image. So this post no longer misleads. Thank you Ben!

Object Character Recognition, or OCR, is something that most historians will need to use at some point when working with digital documents. That is, you will often encounter pdf files of texts that you wish to work with in more detail (digitized newspapers, for instance). Often, there is a layer within the pdf image containing the text already: if you can highlight text by clicking and dragging over the image, you can copy and paste the text from the image. But this is often not the case, or worse, you have tens or hundreds or even thousands of documents to examine. There is commercial software that can do this for you, but it can be quite expensive

One way of doing OCR on your own machine with free tools, is to use Ben Marwick’s pdf-2-text-or-csv.r script for the R programming language. Marwick’s script uses R as wrapper for the Xpdf programme from Foolabs. Xpdf is a pdf viewer, much like Adobe Acrobat. Using Xpdf on its own can be quite tricky, so Marwick’s script will feed your pdf files to Xpdf, and have Xpdf perform the text extraction. For OCR, the script acts as a wrapper for Tesseract, which is not an easy piece of software to work with. There’s a final part to Marwick’s script that will pre-process the resulting text files for various kinds of text analysis, but you can ignore that part for now.

  1. Make sure you have R downloaded and installed on your machine (available from http://www.r-project.org/)
  2. Make sure you have Xpdf downloaded and installed (available from ftp://ftp.foolabs.com/pub/xpdf/xpdfbin-win-3.04.zip ). Make a note of where you unzipped it. In particular, you are looking for the location of the file ‘pdftotext.exe’. Also, make sure you know where ‘pdftoppm’ is located too (it’s in that download).
  3. Download and install Tesseract https://code.google.com/p/tesseract-ocr/ 
  4. Download and install Imagemagick http://www.imagemagick.org/
  5. Have a folder with the pdfs you wish to extract text from.
  6. Open R, and paste Marwick’s script into the script editor window.
  7. Make sure you adjust the path for “dest” and the path to “pdftotext.exe” to the correct location
  8. Run the script! But read the script carefully and make sure you run the bits you need. Ben has commented out the code very well, so it should be fairly straightforward.

Obviously, the above is framed for Windows users. For Mac users, the steps are all the same, except that you use the version of Xpdf, Tesseract, and Imagemagick built for IOS, and your paths to the other software are going to be different. And of course you’re using R for Mac, which means the ‘shell’ commands have to be swapped to ‘system’! (As of July 2014, the Xpdf file for Mac that you want is at ftp://ftp.foolabs.com/pub/xpdf/xpdfbin-mac-3.04.tar.gz ) I’m not 100% certain of any other Mac/PC differences in the R script – these should only exist at those points where R is calling on other resources (rather than on R packages). Caveat lector, eh?

The full R script may be found athttps://gist.github.com/benmarwick/11333467. So here is the section that does the text extraction from pdf images (ie, you can copy and highlight text in the pdf):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
###Note: there's some preprocessing that I (sg) haven't shown here: go see the original gist
 
################# Wait! ####################################
# Before proceeding, make sure you have a copy of pdf2text
# on your computer! Details: https://en.wikipedia.org/wiki/Pdftotext
 
# Tell R what folder contains your 1000s of PDFs
dest <- "G:/somehere/with/many/PDFs"
 
# make a vector of PDF file names
myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)
 
# now there are a few options...
 
############### PDF to TXT #################################
# convert each PDF file that is named in the vector into a text file
# text file is created in the same directory as the PDFs
# note that my pdftotext.exe is in a different location to yours
lapply(myfiles, function(i) system(paste('"C:/Program Files/xpdf/bin64/pdftotext.exe"', paste0('"', i, '"')), wait = FALSE) )
 
# where are the txt files you just made?
dest # in this folder

And here’s the bit that does the OCR

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
</pre>
                     ##### Wait! #####
# Before proceeding, make sure you have a copy of Tesseract
# on your computer! Details & download:
# and a copy of ImageMagick: http://www.imagemagick.org/
# and a copy of pdftoppm on your computer!
# And then after installing those three, restart to
# ensure R can find them on your path.
# And note that this process can be quite slow...
 
# PDF filenames can't have spaces in them for these operations
# so let's get rid of the spaces in the filenames
 
sapply(myfiles, FUN = function(i){
  file.rename(from = i, to =  paste0(dirname(i), "/", gsub(" ", "", basename(i))))
})
 
# get the PDF file names without spaces
myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)
 
# Now we can do the OCR to the renamed PDF files. Don't worry
# if you get messages like 'Config Error: No display
# font for...' it's nothing to worry about
 
lapply(myfiles, function(i){
  # convert pdf to ppm (an image format), using
  shell(shQuote(paste0("pdftoppm ", i, " -f 1 -l 10 -r 600 ocrbook")))
  # convert ppm to tif ready for tesseract
  shell(shQuote(paste0("convert *.ppm ", i, ".tif")))
  # convert tif to text file
  shell(shQuote(paste0("tesseract ", i, ".tif ", i, " -l eng")))
  # delete tif file
  file.remove(paste0(i, ".tif" ))
  })
 
# where are the txt files you just made?
dest # in this folder

Besides showing how to do your own OCR, Marwick’s script shows some of the power of R for doing more than statistics. Mac users might be interested in Ben Schmidt’s tutorial ‘Command-line OCR on a Mac’ from his digital history graduate seminar at Northeastern University, online athttp://benschmidt.org/dighist13/?page_id=129.


출처 : Extracting Text from PDFs; Doing OCR; all within R


바로 : 요즘은 PDF에서 곧장 텍스트를 추출할 수 있다. 그러나 다양한 이유로 텍스트 추출이 어려운 경우가 있다. 물론 이를 지원하는 수 많은 유료소프트웨어가 있기는 하지만......돈 없는 우리가 불법행위를 하지 않으면서 텍스트를 추출하는 방법은 무엇인가?! 저자는 무료툴인 R을 통해서 텍스트를 추출하는 방법에 대해서 서술하고 있다. 



도쿄대학 지식의 구조화 센터(東京大学知の構造化センタ)에서 일본을 대표하는 사상 철학 처널인 이와나미 서점의 「사상(思想)」을 대상으로 디지털인문학의 방법론을 동원하여 연구한 프로젝트이다. 디지털 방법론의 1단계로 종이문서로 되어 있는 “사상(思想)”을 OCR 기술을 이용하여 디지털화하였다. 그 이후 디지털화 된 “사상(思想)” 텍스트에 대하여 자체 제작한 온톨로지 시스템 MIMA서치(MIMAサーチ)을 활용하여 말뭉치(코퍼스) 분석을 하여 그 결과를 시각화하였다. 


일본 잡지 사상의 구조화(「思想」の構造化) http://www.cks.u-tokyo.ac.jp/p1.html 

+ Recent posts