pdf2Data selectors
As mentioned in Basic concepts of extraction, selectors are the building blocks from which parsing rules are built up. At present, there approximately two dozen different selectors in iText pdf2Data.
Some selectors use the content of a data field's region to intelligently analyze content and define the data extraction criteria. These criteria are based on font properties: Font, Font size, Font style, Font Family, text format: Price, IBAN, VAT, Date, Time, Integer, content structure: Table, or text alignment: Align. The Pattern selector can also use the content of a region, however, it isn't mandatory.
Some are completely independent of the region:
- Generic selectors that analyze either the entire document content, such as Image, or can also accept the value's search area restricted by its predecessor: Font color, Regular expression, Paragraph.
- Picker, Line, Relative boundary are auxiliary selectors that will not work without input from another selector. They can be thought of as "second-step" selectors which refine the parsing rules.
- Page is a utility selector; it aims to restrict the area where the data field is going to look for the extracted value.
There are also a few that only extract data from within the region's boundaries: Boundary and Barcode.
Expert mode
Besides the user-friendly selectors described above, iText pdf2Data also gives you the possibility to improve your parsing pipeline in Expert mode. Expert mode extends the functionality of some selectors, as well as allows you to use specific commands improving output, that are as yet unavailable in User mode.
While we hope most people never actually need to use iText pdf2Data in Expert mode, as the selectors should be good enough to meet most expectations, it can be useful for particularly tricky cases. You can of course refer to the guide on Expert mode for details on using it effectively.
List of selectors