Release pdf2Data 4.1 - pdf2Data Documentation

Introduction

pdf2Data is our user-friendly template-based data extraction solution.

It is a crucial part of your digital document workflow. pdf2Data helps you to unlock and reuse data from PDF files with perfect accuracy, across many different domains from logistics to finance.

At the same time, it makes it simple to define and manage extraction rules and templates, thanks to user-friendly components which are designed for non-technical users.

As you may know, pdf2Data 4.0 was released last year and introduced a new Manager component which targets the management and maintenance of templates. This can be an important pain point for anyone who uses multiple templates for parsing purposes.

The Manager makes it easier for our customers who need to maintain dozens, or even hundreds of layouts for different documents.

We’ve been collecting feedback, feature requests, and bug reports from our customers, and looking for ways to raise the bar higher to perfect the pdf2Data management experience.

Now, we are happy to introduce pdf2Data 4.1, the first step on this path, and the first release of pdf2Data as part of the Apryse family.

pdf2Data Manager

We’ve continued to make improvements to the pdf2Data Manager to make it simpler to create, use and reuse extraction templates.

Version control.

Extraction templates (reference file + parsing rules) now have version control. This means you can easily keep track of changes to both data fields for extraction templates, and the reference PDF.

From now on, you can work on templates collaboratively, create test revisions, roll-back to an earlier version if something went wrong, and use this feature in many other scenarios.

Copy template functionality

What if you need to create a new extraction template, which is slightly different from one you have already created? We’ve now got a shortcut for that.

You don’t need to create a brand-new template from scratch anymore. Now you can simply copy it in a single click. Enter a name for the new template, and pdf2Data copies the parsing rules and reference file from the source one.

Performance

We’ve made some memory optimizations to increase performance. No more words, just test out the trial instance on pdf2Data.online and let us know if you think if could be faster.

pdf2Data Editor

Intuitive and user-friendly: these are two epithets we are happy to hear when asking an opinion about the Editor UX.

It’s nice to be able to say we are hearing this quite often! However, we still see a lot of opportunities for further simplification of the extraction setup.

Compared to pdf2Data 4.0, you may not spot significant differences, since we focused mainly on streamlining preparation tasks. Making it simple for developers to roll out the improvements you desire during the next release cycle.

The most notable is the “running text” parameter of the Paragraph selector to make processing of non-structured data easier. Running text can be used whenever you want to be sure the output won’t contain line break symbols, so it is simple to search within or process with some NLP tools.

Besides that, we added possibility to specify the exact date format when the Date selector is used. This resolves issues with using a reference PDF where the date format is unclear. For example, for the date 01.01.2023 there are two possible correct date formats: either dd.mm.yyyy vs mm.dd.yyyy.

Both of the above improvements affect the entire parsing workflow. Please note, your templates will be automatically converted into the new syntax when you open them in the new Editor component

In addition, there are a few more visual changes when editing an extraction template:

The field name popping over when you move the mouse over rectangles on the preview.
Better preview of extraction results.

pdf2Data SDK

The SDK is the key component that handles the job of extracting your data. It is usually hidden from users, but not for developers.

As from this release, the pdf2Data SDK is not only a native Java and .NET (C#) library but is also available also as a Docker container with a REST API.

This means that now it is ready to be deployed on-cloud and used as a microservice, with very little effort from developers.

Please see all available REST commands described in the REST API guide on our Knowledge Base.

Trial experience

As before, all of the above changes are already available on pdf2Data.online, which always has the latest stable version deployed. You can request your free trial to test out the new functionality online, without needing to download or configure anything.

Important note:

Templates created with pdf2Data 4.1 Editor (so on pdf2Data.online) cannot be used with the pdf2Data 4.0 SDK, you must update both components.

Downloads and Links

iText pdf2Data 4.1 SDK	Java ( Artifactory ), .NET ( Artifactory , NuGet ), REST service( deployment instruction )
iText pdf2Data 4.1 Manager&Editor	Deployment guide
iText pdf2Data 4.1 CLI	Jar file download

iText pdf2Data 4.1 SDK

Java (

Artifactory

), .NET (

Artifactory

NuGet

), REST service(

deployment instruction

)

iText pdf2Data 4.1 Manager&Editor

Deployment guide

iText pdf2Data 4.1 CLI

Jar file download