The regular expression selector is the most powerful selector in iText pdf2Data's toolbox. Unsurprisingly then, it is also the least user-friendly selector.

It implements the standard regular expression search, and accordingly requires knowledge of RegExp syntax from a user.



This selector has only one mandatory parameter - Pattern, that contains a regular expression to be found in a PDF.
The regular expressions may also contain groups defined within round brackets. In this case, only the string captured by the group within brackets will be extracted.

Example
Pattern:  Invoice\s+(\d{3}) returns a 3-digit number that appears after the word "Invoice", this number should be separated from "Invoice" by one or more spaces.

Most of the data you require from a PDF can be extracted without this selector, please see the tutorial for example usage. However, if you feel passionate about rexExps, you don`t need anything but the regular expression selector for data extraction.

The regular expression selector becomes even more flexible and powerful in the Expert mode. 

Expert mode keyword

regExp: numberOfLines=2, selectLine=2, checkLocation

There are three additional parameters that can help you with data extraction.

  • numberOfLines the parameter specifies how many regular expressions are defined (optional, default value 1).
  • selectLine the parameter specifies the index of the line that will be extracted as the final output (optional, default value is numberOfLines, so that the last matched group is extracted).
  • checkLocation is an optional parameter. The regExp selector with this option does the search only within the text inside the left and right boundaries of the selector region.

Output data format: 

lines

Example (Expert mode)

regExp
Invoice\s+(\d{3})

In this case, there are no parameters. The selector will use only one pattern for matching and will return the group that captures the pattern (\d{3}) meaning a 3 digit number.

That number must be present in the PDF after the word "Invoice", what is more, there must be at least one space between the extracted "Invoice" and the extracted number

regExp:numberOfLines=2,checkLocation 
INVOICE\s+NUMBER
(\d+)

In this case selectLine is not specified and is equal to 2 by default. The first regular expression will locate the line with the text INVOICE\s+NUMBER (here \s+ means one or more space characters). Then the second regular expression will search for the group of digits below this line. This group of digits will form the result of the data extraction process. 

The checkLocation parameter configures the pattern selector to search for the invoice number (the group of digits matched by the second regular expression) only within the left and right boundaries of the selector region.

Note: When the checkLocation option is being used together with grouping characters, then only the string captured by this group should lie within the left and right boundaries.

List of selectors