OCRmyPDF - 为 PDF 文档增加文本层
软件简介
OCRmyPDF 为 PDF 文件增加了 OCR 文本层,使之可以被方便的检索。
使用方法:
ocrmypdf # it's a scriptable command line program
-l eng+fra # it supports multiple languages
--rotate-pages # it can fix pages that are misrotated
--deskew # it can deskew crooked PDFs!
--title "My PDF" # it can change output metadata
--jobs 4 # it uses multiple cores by default
--output-type pdfa # it produces PDF/A by default
input_scanned.pdf # takes PDF input (or images)
output_searchable.pdf # produces validated PDF output
主要特性:
-
Generates a searchable PDF/A file from a regular PDF
-
Places OCR text accurately below the image to ease copy / paste
-
Keeps the exact resolution of the original embedded images
-
When possible, inserts OCR information as a “lossless” operation without rendering vector information
-
Keeps file size about the same
-
If requested deskews and/or cleans the image before performing OCR
-
Validates input and output files
-
Provides debug mode to enable easy verification of the OCR results
-
Processes pages in parallel when more than one CPU core is available
-
Uses Tesseract OCR engine
-
Supports more than 100 languages recognized by Tesseract
-
Battle-tested on thousands of PDFs, a test suite and continuous integration