The release of pdfOCR 5.0.0 introduced support for pretrained ONNX PaddleOCR and EasyOCR models, adding to the docTR models already supported.
The following code sample shows how to generate a searchable PDF by running OCR with a PaddleOCR model (converted to ONNX format) through pdfOCR’s ONNX-based OCR engine.
After loading the specified input image, it builds a detection predictor and a recognition predictor from the PaddleOCR ONNX model files (inference.onnx) and their accompanying configuration files (inference.yml), and then creates a output PDF.
Check the comments in the example for more details.
Compatible PaddleOCR/EasyOCR models already converted to ONNX format are available from our Hugging Face repository.
Java
##GITHUB:https://github.com/itext/itext-publications-examples-java/blob/develop/src/main/java/com/itextpdf/samples/sandbox/pdfocr/onnx/PdfOcrOnnxPaddleOcrExample.java##
C#
##GITHUB:https://github.com/itext/itext-publications-samples-dotnet/blob/develop/itext/itext.samples/itext/samples/sandbox/pdfocr/onnx/PdfOcrOnnxPaddleOcrExample.cs##
Using language-specific models
If you need to recognize text in specific languages, for best results use a dedicated language model. You’ll find a selection of converted ONNX models on our Hugging Face model repository.
We use the multi-language PP-OCRv5_mobile_rec_infer model in our samples. This default configuration of the recognition model can accurately identify five major language types:
-
Simplified Chinese
-
Pinyin
-
Traditional Chinese
-
English
-
Japanese
However, if you use this model for Hindi the results will be incorrect. For example:
-
Expected Result: “मानक हनिदी Hindi”
-
Actual Result: “HIaTh a Hindi”
You should instead use the the devanagari_PP-OCRv5_mobile_rec_infer model for Hindi, which will give more accurate results.
To check if a specific language is supported, see the PaddleOCR documentation. The EasyOCR documentation site also maintains a list of supported languages.