The innovative iText pdf2Data 4.0 release brings a lot of important improvements and features empowering data extraction (please see the release notes to get to know more). At the same time, to make it real we had to introduce some breaking changes, of which the new template format and the pdf2Data Manager are the most impactful for your workflows.
In this article we'll discuss these changes and how you can migrate from version 3.x to version 4.x
Template Creation with pdf2Data Manager and pdf2Data Editor
From now you can choose one of two components to create Extraction templates, either the pdf2Data Editor or pdf2Data Manager.
To install pdf2Data Manager, which already includes pdf2Data Editor we recommend using our new deployment script. See more details on it's usage here.
iText pdf2Data 4.0 doesn't support templates created in the previous versions of pdf2Data (1.*-3.*), however, those can be easily converted:
- For pdf2Data Manager you can use the import button:
In pdf2Data Editor it is even easier, you can upload templates in the old format via the upload form on the start page and it will be automatically converted into the new format.
Templates can be downloaded from the pdf2Data Editor in the new format only.
PDF parsing with pdf2Data SDK
After upgrading to iText pdf2Data SDK v4.x you can continue to use the API for processing PDF with templates in the old format, but this API is deprecated and will be deleted in the next release.
Therefore, we recommend you to already start using the new template format:
First of all, you need to migrate your templates.
You can also convert your template through importing into the pdf2Data Manager (as mentioned above).
Note that the template could be in two forms: unprocessed which was PDF-based in v3 and processed - XML-based in v3.
This concept is preserved in iText pdf2Data 4.0, but to make it easier to distinguish between those formats we introduced two different file types:
pd2ta - for unprocessed templates , and p2d for processed ones, The second format is used by the SDK while parsing.
Convert unprocessed templates
Convert processed templates (XML)
Also note that the new SDK API which is doing the actual extraction now works with processed templates only. So, in case you have source templates in the unprocessed PDF format, you will need to actually process them first. For that we have one more utility:
Process unprocessed templates
For performance reasons, we recommend that all the conversions are made once separately from your main production flow.
The initialization of the
Pdf2DataExtractor instance from a processed template should now be done with one function call:
The rest of the API remains more or less the same, with the only note that we now recommend using
File or stream objects instead of
String paths as an input:
Save to XML:
Save to JSON:
If for some reason you need a
Template object (e.g. for processing grouped results programmatically) you can reach it from extractor via getter:
help option is removed. Only the
--help flags remain:
- License is not needed for preprocess.
--templateoptions replaced with
--sourcerespectively. The input file shall be in
.p2dtaformat (unprocessed template archive).
--xmloptions replaced with
--destinationrespectively. The output file would be in
.p2dformat (processed template archive).
Template file to be passed as
--template argument shall be of
.p2d type (processed template archive).