OCRmyPDF 为 PDF 文件增加了 OCR 文本层,使之可以被方便的检索。
使用方法:
ocrmypdf # it's a scriptable command line program -l eng+fra # it supports multiple languages --rotate-pages # it can fix pages that are misrotated --deskew # it can deskew crooked PDFs! --title "My PDF" # it can change output metadata --jobs 4 # it uses multiple cores by default --output-type pdfa # it produces PDF/A by default input_scanned.pdf # takes PDF input (or images) output_searchable.pdf # produces validated PDF output
主要特性:
Generates a searchable PDF/A file from a regular PDF
Places OCR text accurately below the image to ease copy / paste
Keeps the exact resolution of the original embedded images
When possible, inserts OCR information as a “lossless” operation without rendering vector information
Keeps file size about the same
If requested deskews and/or cleans the image before performing OCR
Validates input and output files
Provides debug mode to enable easy verification of the OCR results
Processes pages in parallel when more than one CPU core is available
Uses Tesseract OCR engine
Supports more than 100 languages recognized by Tesseract
Battle-tested on thousands of PDFs, a test suite and continuous integration