Release date: October 22, 2020
pdfOCR 1.0.2 is already the third release of our newest product.
It brings some important improvements which allow you to process documents more precisely. These are:
-
Refinement of the symbol position based on the HOCR data that fixes output for Thai and some CJK fonts. This is especially important for our pdfCalligraph customers.
You can turn it on with:
tesseract4OcrEngineProperties.setUseTxtToImproveHocrParsing(true); -
Possibility for configuration of image preprocessing. That allows smoothing out fluctuations in a document's brightness to give you better results in cases of images taken by a camera.
You can pass the parameters which are described on http://www.leptonica.org/binarization.html usingtesseract4OcrEngineProperties.setImagePreprocessingOptions
Downloads
|
|
||||
|---|---|---|---|---|
|
iText pdfOCR – 1.0.2 (Java API) |
N/A |
|||
|
iText pdfOCR – 1.0.2 (.NET API) |
N/A |
Changelog
Improvements
-
Combine HOCR and TXT outputs for more precise text recognition
-
Add possibility to set image preprocessing properties (adaptive threshold tile size, threshold smoothing)
Installation Instructions
Examples (latest ones)
FAQ (latest ones)
- Which languages are supported in pdfOCR?
- What does TextPositioning in pdfOCR do?
- Could not find a glyph corresponding to Unicode character
- pdfOCR: If your scanned document has a mixture of sections with paragraphs and tables, what is a recommended strategy here?
- pdfOCR: Is handwriting recognition supported