pdf2Data: Grouping of extracted values in XML
pdf2Data 2.1.9 introduces a new rule: groupByTb: dataField. This rule allows the grouping of extracted values in the output XML file, depending on their y coordinate related to the other data field coordinates.
Note that this article only applies to versions prior to the release of iText pdf2Data 3.1.1. If you are using that version or later, see this updated article.
For now, the creation of this rule is available only in the Expert mode of the pdf2Data template editor, which is actually not as difficult as it sounds.
The grouByTb rule is a quite flexible and yet powerful mechanism, which can be used in many use-cases. One of which we will demonstrate here.
In the screenshot below you can see we have two different invoices:
The extraction of the invoice numbers and their related total amounts is our goal here. So we will create 2 data fields (in the User mode):
Total and InvoiceNumber.
Without the groupByTb operator you get the following output:
<elements>
<data name="InvoiceNumber">
<text>1</text>
<text>2</text>
</data>
<data name="Total">
<text>150</text>
<text>60</text>
</data>
</elements>
Which, to be honest is not that useful for further processing.
In our file, we can see that the Invoice word strictly separates the data field values, so we can using it as a value for our groupByTb operator.
But we need to create an auxiliary data field (we are using the Pattern selector here, but the field can be as complicated as necessary):
Then switch to Expert mode (the Expert button in the panel on the left)
For both of our fields we need to add the line: groupByTb: separator after all instructions:
So
After that, in the User mode, the XML grouping selector will appear in the field's properties.
And the output will be as follows:
<elements>
<data grouped="true" name="separator">
<group>
<text>Invoice</text>
<data name="InvoiceNumber">
<text>1</text>
</data>
<data name="Total">
<text>150</text>
</data>
</group>
<group>
<text>Invoice</text>
<data name="InvoiceNumber">
<text>2</text>
</data>
<data name="Total">
<text>60</text>
</data>
</group>
</data>
</elements>