Release iText pdf2Data 4.0

Introduction

iText pdf2Data is our user-friendly template-based data extraction solution.

It is a crucial part of your digital document workflow. iText pdf2Data helps you to unlock and reuse data from PDF files with perfect accuracy, across many different domains from logistics to finance.

At the same time, it makes it simple to define and manage extraction rules and templates, thanks to user-friendly components which are designed for non-technical users.

We are proud to introduce a new major release of pdf2Data, iText pdf2Data 4.0.

Breaking changes

New extraction template format (please see the migration guide)
pdf2Data Editor is not available as a War archive anymore (please use the containerized version ).

pdf2Data Manager

In this release, we are introducing a management component for iText pdf2Data – the pdf2Data Manager.

It acts not only as a centralized storage for all your templates but also provides:

user access control
management of multiple workspaces
replacement of reference PDFs
easy extraction template creation from blueprints, and
parsing adjustments for existing templates

To allow iText pdf2Data to support all this, we are moving to a new more flexible and reusable format for extraction templates. You won’t need to recreate your existing templates though, since the pdf2Data Manager also includes a converter tool. This will enable you to import and convert your legacy templates into the new format.

pdf2Data Editor

The new pdf2Data Manager is natively integrated with our existing pdf2Data Editor, which also gets some improvements.

As you might expect, user-friendliness and a great user experience is key for this component. Despite this, in previous versions, you sometimes needed to use the expert mode and be familiar with the specific extraction language of pdf2Data to get the most out of it.

That's not the case anymore…

From now on, all extraction functionality is entirely available in the UI. The expert mode still exists though, so you can continue to use it if you want. However, you now also get the benefit of the new more convenient syntax.

pdf2Data SDK

The SDK is the key component that handles the job of extracting your data. It is usually hidden from users, but not for developers.

We’ve been thinking for a while about API improvements, so developers can read less documentation in order to integrate it into workflows. Since this release is a major one, we’ve introduced a number of API changes for the pdf2Data SDK. Overall, they make the API clearer and more consistent.

Extraction

Another key part of iText pdf2Data is the SDK’s extraction algorithms. These are custom-built to deal with document elements such as tables, paragraphs, dates, etc. We are working on adding to and improving these all the time, and this release is no exception.

In a nutshell:

Table extraction gained improved merging strategies, for tables that span multiple pages. Error messages became clearer, so more useful for debugging. The overall extraction process became more stable, reducing the chance of exceptions leading to problems.

Of course, the SDK also fully supports the new template format.

For more details, please see the "migration guide"

Downloads and Links

iText pdf2Data 4.0 SDK	Java (Artifactory), .NET (Artifactory, NuGet)
iText pdf2Data 4.0 Editor	Pull image from Docker Hub (instructions)
iText pdf2Data 4.0 CLI	Jar file download

Breaking changes

New extraction template format (please see the migration guide)
pdf2Data Editor is not available as a War archive anymore (please use the containerized version).

Improvements & New Features

pdf2Data Manager

Multiple workspaces
Authorization
Access control is based on roles and workspace
Creation of template from blueprints
Search by template name and metadata
A reference PDF in a template can be replaced

pdf2Data Editor

Grouping selector
Multiline regular expressions
Improved result preview
Client-side validation of selectors

pdf2Data SDK

Improved extraction of large tables which span multiple pages
Standardization of the extraction API
Extended support of File and Stream classes in API
Support of pdf2Data 4.0 Template formats (.p2d, .p2dta)