Release iText pdf2Data 3.1.1

Introduction

We are proud to announce the release of iText pdf2Data 3.1, the latest version of our template-based data extraction solution. iText pdf2Data intelligently recognizes data inside structured and semi-structured PDF documents and extracts them in a structured format.

iText pdf2Data consists of two main components: as the browser-based pdf2Data Editor which enables creation of extraction templates and the pdf2Data SDK (available for Java, .NET, and as a command-line interface application) that you use to automatically extract data from PDF documents. This data can then be used in customer processes such as business analytics and reporting.

The main focus of this release is on the SDK side; adding JSON output support to simplify the process of reusing extraction data, and improving the accuracy of our high-level extraction selectors.

What's new

JSON output

An important innovation for iText pdf2Data 3.1.1 is the addition of support for output in JSON format. From now on both the native SDK libraries and the CLI variant are now able to output extracted data in JSON format as well as XML. This will allow more convenient integration into workflows in microservices and cloud-based solutions, as JSON is the de-facto standard for these applications and so is especially widely used there.

For anyone who prefers XML though, this output option is still available and can be used in the same way as before.

Improved data extraction

A key feature of iText pdf2Data is that to ease the process of data extraction, it provides high-level selectors which your less-technical employees can use from the intuitive template editor. The accuracy of these selectors and therefore the extraction algorithms behind them are vitally important for our customers.

In this release, we focused on tweaking two of them in particular: Date and Price. As well as being able to manually configure these selectors to improve extraction, we also improved the validation of extracted values so you will get exactly what you expect in the XML or JSON output. You can now avoid getting outputs such as “32nd of July” from the Date selector or prices in Euros when parsing US invoices.

Special mention should be made of the improved table selector since it is a favorite selector of many customers. Indeed, iText pdf2Data features one of the best table extraction algorithms around, and so it is a significant reason our customers use iText pdf2Data. We’re always working to raise the bar for the recognition and extraction of tables in PDF though, and this release is no exception.

The table selector became:

Agnostic to big line spacing
So the table selector doesn't handle such cases as two separate tables anymore, but instead merges all rows together.
Can ignore watermarks if there are any.
Watermarks usually have different styling and don't respect the table structure, so could break table recognition.

Improved user experience

Users can now expect a better experience while creating extraction templates, as we've been making efforts to reduce the learning curve for new users. In addition to the improved high-level selectors, we’ve also revised the messaging in the pdf2Data Editor to provide users with clearer explanations and make it easier to begin data extraction.

And of course, we never stop improving pdf2Data documentation regardless of our release schedule!

What else?

We’ve fixed a couple of bugs in this release; one for PDFs which contain unsupported color spaces, and an out of memory exception which could occur when grouping lines with the Paragraph selector.

Downloads and Links

iText pdf2Data 3.1.1 SDK	Java (Artifactory), .NET (Artifactory, NuGet)
iText pdf2Data 3.1.1 Editor	Pull image from Docker Hub (instructions) or use the war
iText pdf2Data 3.1.1 CLI	Jar file download

New feature

JSON as a format for output data

Improvements

Table selector ignores watermarks
Table selector can work with various line spacings
Date selector will no longer extract invalid dates
Increased accuracy for price extraction (with the Price selector)
Template editor: Icons for inline help

Bug fixes

NPE: When PDF contains unsupported Color space
OOM: When the Paragraph selector is being used for grouping lines