Skip to main content
Skip table of contents

pdfOCR: If your scanned document has a mixture of sections with paragraphs and tables, what is a recommended strategy here?

This question is related to the TextPositioning property, which you can read about it here.

In this situation where you have mixed content and it's unclear whether it would be better to use BY_WORDS or BY_LINE, we would recommend using the BY_WORDS strategy, as it would still allow you to group words into paragraphs without losing the words' boundaries.

In addition, since pdfOCR 1.0.1, you can also use BY_WORDS_AND_LINES. This is similar to the BY_WORDS mode, but the top and bottom of the word bounding box are inherited from the line.

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.