pdfOCR: PaddleOCR model support | iText Knowledge Base

The release of pdfOCR 5.0.0 introduced support for pretrained ONNX PaddleOCR and EasyOCR models, adding to the docTR models already supported.

The following code sample shows how to generate a searchable PDF by running OCR with a PaddleOCR model (converted to ONNX format) through pdfOCR’s ONNX-based OCR engine.

After loading the specified input image, it builds a detection predictor and a recognition predictor from the PaddleOCR ONNX model files (inference.onnx) and their accompanying configuration files (inference.yml), and then creates a output PDF.

Check the comments in the example for more details.

Compatible PaddleOCR/EasyOCR models already converted to ONNX format are available from our Hugging Face repository.

Java

##GITHUB:https://github.com/itext/itext-publications-examples-java/blob/develop/src/main/java/com/itextpdf/samples/sandbox/pdfocr/onnx/PdfOcrOnnxPaddleOcrExample.java##

C#

##GITHUB:https://github.com/itext/itext-publications-samples-dotnet/blob/develop/itext/itext.samples/itext/samples/sandbox/pdfocr/onnx/PdfOcrOnnxPaddleOcrExample.cs##

Using language-specific models

If you need to recognize text in specific languages, for best results use a dedicated language model. You’ll find a selection of converted ONNX models on our Hugging Face model repository.

We use the multi-language PP-OCRv5_mobile_rec_infer model in our samples. This default configuration of the recognition model can accurately identify five major language types:

Simplified Chinese
Pinyin
Traditional Chinese
English
Japanese

However, if you use this model for Hindi the results will be incorrect. For example:

Expected Result: “मानक हनिदी Hindi”
Actual Result: “HIaTh a Hindi”

You should instead use the the devanagari_PP-OCRv5_mobile_rec_infer model for Hindi, which will give more accurate results.

To check if a specific language is supported, see the PaddleOCR documentation. The EasyOCR documentation site also maintains a list of supported languages.