pdfOCR: If your scanned document has a mixture of sections with paragraphs and tables, what is a recommended strategy here?
This question is related to the TextPositioning
property, which you can read about it here.
In this situation where you have mixed content and it's unclear whether it would be better to use BY_WORDS
or BY_LINE
, we would recommend using the BY_WORDS
strategy, as it would still allow you to group words into paragraphs without losing the words' boundaries.
In addition, since pdfOCR 1.0.1, you can also use BY_WORDS_AND_LINES
. This is similar to the BY_WORDS
mode, but the top and bottom of the word bounding box are inherited from the line.