Release iText pdf2Data 4.0
Introduction
iText pdf2Data is our user-friendly template-based data extraction solution.
It is a crucial part of your digital document workflow. iText pdf2Data helps you to unlock and reuse data from PDF files with perfect accuracy, across many different domains from logistics to finance.
At the same time, it makes it simple to define and manage extraction rules and templates, thanks to user-friendly components which are designed for non-technical users.
We are proud to introduce a new major release of pdf2Data, iText pdf2Data 4.0.
Breaking changes
- New extraction template format (please see the migration guide)
- pdf2Data Editor is not available as a War archive anymore (please use the containerized version ).
pdf2Data Manager
In this release, we are introducing a management component for iText pdf2Data – the pdf2Data Manager.
It acts not only as a centralized storage for all your templates but also provides:
- user access control
- management of multiple workspaces
- replacement of reference PDFs
- easy extraction template creation from blueprints, and
- parsing adjustments for existing templates
To allow iText pdf2Data to support all this, we are moving to a new more flexible and reusable format for extraction templates. You won’t need to recreate your existing templates though, since the pdf2Data Manager also includes a converter tool. This will enable you to import and convert your legacy templates into the new format.
pdf2Data Editor
The new pdf2Data Manager is natively integrated with our existing pdf2Data Editor, which also gets some improvements.
As you might expect, user-friendliness and a great user experience is key for this component. Despite this, in previous versions, you sometimes needed to use the expert mode and be familiar with the specific extraction language of pdf2Data to get the most out of it.
That's not the case anymore…
From now on, all extraction functionality is entirely available in the UI. The expert mode still exists though, so you can continue to use it if you want. However, you now also get the benefit of the new more convenient syntax.
pdf2Data SDK
The SDK is the key component that handles the job of extracting your data. It is usually hidden from users, but not for developers.
We’ve been thinking for a while about API improvements, so developers can read less documentation in order to integrate it into workflows. Since this release is a major one, we’ve introduced a number of API changes for the pdf2Data SDK. Overall, they make the API clearer and more consistent.
Extraction
Another key part of iText pdf2Data is the SDK’s extraction algorithms. These are custom-built to deal with document elements such as tables, paragraphs, dates, etc. We are working on adding to and improving these all the time, and this release is no exception.
In a nutshell:
Table extraction gained improved merging strategies, for tables that span multiple pages. Error messages became clearer, so more useful for debugging. The overall extraction process became more stable, reducing the chance of exceptions leading to problems.
Of course, the SDK also fully supports the new template format.
For more details, please see the "migration guide"
Downloads and Links
iText pdf2Data 4.0 SDK | Java (Artifactory), .NET (Artifactory, NuGet) |
---|---|
iText pdf2Data 4.0 Editor | Pull image from Docker Hub (instructions) |
iText pdf2Data 4.0 CLI | Jar file download |
Breaking changes
- New extraction template format (please see the migration guide)
- pdf2Data Editor is not available as a War archive anymore (please use the containerized version).
Improvements & New Features
pdf2Data Manager
- Multiple workspaces
- Authorization
- Access control is based on roles and workspace
- Creation of template from blueprints
- Search by template name and metadata
- A reference PDF in a template can be replaced
pdf2Data Editor
- Grouping selector
- Multiline regular expressions
- Improved result preview
- Client-side validation of selectors
pdf2Data SDK
- Improved extraction of large tables which span multiple pages
- Standardization of the extraction API
- Extended support of File and Stream classes in API
- Support of pdf2Data 4.0 Template formats (.p2d, .p2dta)