The table selector focuses on table recognition and extraction. It is functional enough to fit all the needs of table extraction by itself, so you don`t need to add any other selectors to the parsing flow for it to work.
Automatic table mode
The table selector can automatically detect a table structure in automatic mode, when a portion of a table lies inside the data field`s region.
In automatic mode, there are 3 properties that allow you to filter out extracted data:
- Select row
Defines rows to be extracted based on their indices from top to bottom. Indices start with 1. If present, a table header's row will also be indexed.
You can select a single number or a range. Using a negative number, it is possible to specify a backwards index where -1 is the last row of a table.
For instance. the
2:-2range means that all table rows except for the first and the last ones will be extracted
- Select column
Defines columns to be extracted based on their indices from left to right. Indices start with 1.
You can select a single column using its index or name (that is the same as its header) as well as a continuous range
With a negative number, it is possible to specify a backwards index of a column. where -1 is the rightmost column of a table.
For instance. the
2:-2range means that all table columns except for the left and the right ones will be extracted.
- Column for table building
Sometimes, when not all table rows have horizontal borders or there are empty cells present, rows may be detected inaccurately.
In this case, the "column for table building" is being used by the selector as the main reference to determine the table rows.
For this parameter, it is recommended to use the number of a column in which all cells will always be filled.
So, in the case of invoices it can be the "Total" column.
Advanced table mode
To extract tables with a complex structure, when table headers aren`t static for example it can be useful to use the advanced table mode by deselecting the "Automatic headers" option.
This mode allows you to set up headers for the columns you desire.
It can be done semi-automatically, by selecting the document area with headers or fully-manually, where you need to type the desirable column names in the "Headers" parameter.
Headers should be typed, one below the other, starting from the leftmost. If any header consists of two or more lines, these lines must be concatenated into one, separated by spaces.
One can also specify headers using Regular expressions by clicking on the icon in the top right corner of the Headers area.
If a table spans more than one page, the selection algorithm will also select it as a single page, where all table columns have the same width on subsequent pages.
The repeated header and footer (if any) are filtered out from the final results, so that only the first header and the last footer are retained.
The multipage selection algorithm also detects and ignores any page headers or footers.
For better results in the case of multipage tables, we recommend using advanced table mode and specifying all table headers explicitly.
Output data format:
List of selectors