What is iText pdf2Data

Many PDF documents businesses need to process, such as registration forms, invoices etc. follow a common structure. If we take the example of an invoice document, addresses, purchase order numbers, and similar document elements tend to be located in one place, and only the content such as item descriptions, quantities, and cost of items change from invoice to invoice.

iText pdf2Data offers an easy way to extract data from such PDF documents by defining areas and rules in a template which correspond to the content you want to extract. The template can then be visually validated with other documents to confirm data is recognized correctly, before being parsed by the pdf2Data SDK to process all subsequent documents matching that template.

iText pdf2Data saves employees from routine data extraction tasks, and the company speeds up processes and reduces its costs.

You don’t need hundreds of samples and intensive supervision to train the recognition process. The content recognition is controlled by the template you configure, meaning no training is required before you can begin extracting data. You only need one example document to enable data extraction from all subsequent documents.

Making modifications to templates is quick and easy, also iText pdf2Data offers excellent language support.

You can start using iText pdf2Data for mass document processing in 3 steps.

  1. Integrate the pdf2Data SDK into your Java or .NET application.
  2. Using one sample PDF, create an extraction template
  3. Process subsequent documents matching that template   

You need to involve the Development team only for the first step. Template creation and adjustment don`t require technical knowledge since pdf2Data provides an intuitive browser-based Template Editor.

Components

iText pdf2Data consists of three modules:

  1. pdf2Data Template Editor - a browser-based app for the creation of extraction templates.
  2. pdf2Data SDK - this is available for both Java and .NET ecosystems and can be used inside your code to extract values in XML format according to the template.
  3. pdf2Data CLI  - a command-line interface for pdf2Data, and is the fastest way to test data extraction locally.   

Integration of the SDK is pretty straightforward and usually only needs to be done once (please see the installation guide).

Adding support of a new document type to the parsing flow normally doesn't require any changes in your code, just the creation of a new extraction template.

Extract data from encrypted documents

iText pdf2Data is able to extract data from encrypted documents, provided that:

  • the document is password protected with an blank document open (aka a 'user') password
  • the document has permission to extract text

Encrypted documents cannot be used for template creation, as this requires the modification of the PDF document. You will need to decrypt such documents beforehand.

You can decrypt PDF using iText 7 Core functionality.

Next Steps

To see iText pdf2Data in action, see our Extracting data from PDF invoices tutorial. This tutorial shows how to build and validate an extraction template to process PDF invoices in pdf2Data Editor. 
To understand iText pdf2Data  better, we recommend reading through the sections Basic concepts of data extraction and Installation guidelines.