This question is related to the TextPositioning property, which you can read about it here.

In this situation where you have mixed content and it's unclear whether it would be better to use BY_WORDS or BY_LINE, we would recommend using the BY_WORDS strategy, as it would still allow you to group words into paragraphs without losing the words' boundaries.

In addition, since pdfOCR 1.0.1, you can also use BY_WORDS_AND_LINES. This is similar to the BY_WORDS mode, but the top and bottom of the word bounding box are inherited from the line.