We are proud to announce the first release of pdfOCR, the newest addition to our iText 7 Suite, which enables you to OCR your images into fully ISO-compliant PDF or PDF/A-3u files, making it possible to access and process the text they contain.

Given that we rely on the open-source Tesseract 4.x project to do the heavy lifting, we couldn't, in conscience, not make this add-on open source as well (Java GitHub repository and .NET GitHub repository), so feel free to head on over there, and check out (pun intended), how we brought Tesseract into our ecosystem.

You may also notice that we have split up the project in two. We have an API module and the implementation module for Tesseract. In essence this means that you can hook up other OCR engines to iText, but it also means that we're not closing the door on taking on more options for our users to choose from.

As it is a new product, we are still busy creating some documentation for it, although its usage is pretty straightforward... Don't believe us? Check this out: 

import com.itextpdf.kernel.pdf.PdfWriter; import com.itextpdf.pdfocr.OcrPdfCreator; import com.itextpdf.pdfocr.tesseract4.Tesseract4LibOcrEngine; import com.itextpdf.pdfocr.tesseract4.Tesseract4OcrEngineProperties; import java.io.File; import java.io.IOException; import java.util.Arrays; import java.util.List; public class JDoodle { private static List LIST_IMAGES_OCR = Arrays.asList(new File("invoice_front.jpg")); private static String OUTPUT_PDF = "/myfiles/hello.pdf"; public static void main(String[] args) throws IOException { OcrPdfCreator ocrPdfCreator = new OcrPdfCreator(new Tesseract4LibOcrEngine(new Tesseract4OcrEngineProperties())); try (PdfWriter writer = new PdfWriter(OUTPUT_PDF)) { ocrPdfCreator.createPdf(LIST_IMAGES_OCR, writer).close(); } } }
using System.Collections.Generic; using System.IO; using iText.Kernel.Pdf; using iText.Pdfocr; using iText.Pdfocr.Tesseract4; public class Program { private static string OUTPUT_PDF = "/myfiles/hello.pdf"; private static IList LIST_IMAGES_OCR = new List { new FileInfo("invoice_front.jpg") }; static void Main() { { var ocrPdfCreator = new OcrPdfCreator(new Tesseract4LibOcrEngine(new Tesseract4OcrEngineProperties())); using (var writer = new PdfWriter(OUTPUT_PDF)) { ocrPdfCreator.CreatePdf(LIST_IMAGES_OCR, writer).Close(); } } }

In any case, here's a list to get you started:

Installation Instructions




We reckon that pdfOCR goes very well with pdf2Data (so you can extract that freshly recognized information), pdfSweep (so you can redact the information that you definitely mustn't keep) and a pinch of pdfCalligraph (so that your advanced language scripts continue to look perfect). But this is just us throwing a crazy recipe together. You do you.