Installing iText pdfOCR for Java developers
How to install pdfOCR Java version
Thank you for your interest in our OCR add-on pdfOCR, we hope you will enjoy using our product and share your experiences with us and the iText community. We will walk you through the installation process, from downloading iText pdfOCR to adding the dependency to your Java build tool.
If you require any extra help please have a look at our FAQs or the community discussion at StackOverflow. If you are interested in getting support from our in-house developers and/or a license key for commercial iText products, you will need to acquire a commercial license.
Before you install
If you want to use pdfOCR for non-commercial purposes, make sure you have read and agreed upon the AGPL license. All downloads we offer open-source come with the AGPL license model.
If you want to use pdfOCR for commercial purposes, make sure you have purchased a commercial license for iText Core. All downloads we offer closed-source come with our commercial license model.
For closed-source pdfOCR installation, download and install the proper license key library, you can find the installation guide here (you will need at least version 3.1.1 of the library).
Check the compatibility matrix to ensure the version you specify when adding the add-on's dependency matches the version of iText Core you have a license for.
Download the modules (.jar) of iText Core/Community and pdfOCR (ZIP files) from Maven Central or the iText Artifactory Server.
Install iText Core or Community, you can find the installation guide here.
Important remark: in the installation guide we use Maven as build tool for Java.
You will need Tesseract's training data, which you can get here.
Installation
Using the Central Repository
iText pdfOCR is available via Maven on The Central repository. Simply add iText pdfOCR as a dependency to your pom.xml:
Using the iText Artifactory Server
iText pdfOCR is also available on the iText Artifactory server. Here you can also find the license key library, and pdfOCR add-on - you require an additional license key if you want to use pdfOCR closed-source (commercial purposes).
You can add this server as an additional Maven repository in the repositories section of your pom.xml or settings.xml, as described in the Maven documentation. Maven will then automatically query this repository for the add-on .jar files.
You can also browse the iText Artifactory server and download jars manually.
1. Add repository to .pom project file
<!-- All add-ons and iText Core-->
<repositories>
<repository>
<id>itext</id>
<name>iText Repository - releases</name>
<url>https://repo.itextsupport.com/releases</url>
</repository>
</repositories>
2. Pick which OCR engine you will want to use. Either tesseract4, or ONNX.
a. If tesseract4 is your flavour:
<properties>
<itext.pdfocr.version>$release-pdfOCR-variable</itext.pdfocr.version>
</properties>
<dependencies>
<dependency>
<groupId>com.itextpdf</groupId>
<artifactId>pdfocr-tesseract4</artifactId>
<version>${itext.pdfocr.version}</version>
</dependency>
</dependencies>
b. If ONNX is your flavour:
<properties>
<itext.pdfocr.version>$release-pdfOCR-variable</itext.pdfocr.version>
</properties>
<dependencies>
<!-- ONNX Abstract (required for both CPU and GPU execution) -->
<dependency>
<groupId>com.itextpdf</groupId>
<artifactId>pdfocr-onnx-abstract</artifactId>
<version>${itext.pdfocr.version}</version>
</dependency>
<!-- ONNX CPU (for CPU inference only) -->
<dependency>
<groupId>com.itextpdf</groupId>
<artifactId>pdfocr-onnx-cpu</artifactId>
<version>${itext.pdfocr.version}</version>
<!-- Exclude if using GPU -->
</dependency>
<!-- OnnxRuntime GPU to use with pdfocr-onnx-abstract. Source: https://mvnrepository.com/artifact/com.microsoft.onnxruntime/onnxruntime_gpu -->
<dependency>
<groupId>com.microsoft.onnxruntime</groupId>
<artifactId>onnxruntime_gpu</artifactId>
<version>${onnxruntime.version}</version>
<!-- Exclude if using CPU -->
</dependency>
</dependencies>
Memory usage
In case memory usage is too high and you want to increase memory allocation pool size to prevent OutOfMemoryError (OOM), -Xmx12g -Dorg.bytedeco.javacpp.maxPhysicalBytes=8g arguments can be specified for the run. The -Xmx maximum heap size is the largest size the heap can grow up to; maxPhysicalBytes defines maximum amount of memory reported by physicalBytes() before forcing call to System.gc().
Configuring GPU for ONNX Engine
Dependencies
Use
pdfocr-onnx-abstractand ensure a compatible execution provider is available.For GPU execution,
onnxruntime_gpudependency is also required.
Do not include
pdfocr-onnx-cpuif using GPU.
Don’t forget to download the ONNX model(s) you’d want to use (you will need to specify them with OnnxOcrEngine).
You can find a wide range of compatible PaddleOCR/EasyOCR models from the following Hugging Face repository:
For docTR, we currently recommend the following models:
Felix92/onnxtr-fast-tiny for detection
Felix92/doctr-dummy-torch-crnn-vgg16-bn for recognition
(you will have to download the model .onnx files and use it as per the documentation)
iText pdfOCR Java on GitHub
The source code is available on GitHub.
You can download the modules (.jar) of iText pdfOCR in ZIP files from Maven Central: