Basic concepts of data extraction

This section explains the core ideas behind iText pdf2Data. Understanding these ideas can help you design and build your applications more effectively.

If you can't wait to start creating your own template without any boring theory, you can jump straight into the Extracting data from invoice tutorial. However, if you'd like to learn some details about how iText pdf2Data does its magic, please read on.

How iText pdf2Data works

Extraction template

The PDF data extraction process starts with the creation of an extraction template. Each extraction template is a combination of the following two elements:

A sample PDF to build and test parsing on.
Several data fields which define the way data will be extracted.

Once an extraction template has been created, it is then used as a basis for future PDFs matching the template.

For iText pdf2Data you normally create a data field for each value you want to extract. Extraction templates can be created using browser-based pdf2Data 3.0 Editor.

Data field

A data field always has an associated region and a parsing pipeline, or a parsing rule.

Region
This is a rectangular area you can define on the sample PDF file.
Parsing pipeline.
This is created from the predefined pdf2Data selectors.

The data field's extracted value is the result of applying the parsing pipeline to a PDF document. During the data extraction algorithm, each selector receives data from the previous selector in the pipeline, converts it to the necessary format, filters the data out, and sends the filtered data to the next selector. The data extracted by a data field is the result of the last selector in the pipeline.

The content of the region is used by iText pdf2Data to intelligently adjust and configure parsing pipelines, so affects the output.

Extracted values.

To process PDF files with an extraction template you should use either the CLI or SDK modules of pdf2Data. Both extract data from PDF, using data fields defined in an extraction template, and provide all extracted values as a single XML file.

pdf2Data CLI is a command-line interface that doesn`t need any coding to be used. However, for mass document handling, you would likely prefer to use the pdf2Data SDK component within your application.

pdf2Data SDK is available as Java or .NET library. That means you need to write code to use it. Happily, use of the SDK is pretty straightforward, and normally is only required to be done once. Please see the installation guidelines.

Next Steps

To see iText pdf2Data in action, see our Build template for extracting data from PDF invoices tutorial. This tutorial shows how using the pdf2Data 3.0 Editor you can build and validate an extraction template for PDF invoices.