How to extract a page number from a PDF file?
In iText we tried PdfPageLabels.getPageLabels(reader)
but the behavior of this method is not uniform.
Posted on StackOverflow on Oct 31, 2014 by abhinav sharma
The reason why you don't find any software that is able to extract page numbers from a PDF is simple: the concept of a page number doesn't exist in PDF.
Allow me to predict your response.
"Wait a minute!" you say, "When I open a PDF in Adobe Reader, I can clearly see a page number in the document!"
Yes, you can see that page number with your eyes and your human intelligence, but to a machine that number is just some text drawn on a canvas. A machine consuming the document has no idea what all the glyphs and lines and shapes on a page are about. Hence, software can not give you the page number you see as a human. A machine doesn't know where to look!
If you know something about PDF, I can predict your next reply.
"Wait a minute!" you say, "What about Tagged PDF? Doesn't Tagged PDF mean that the semantics of a document are stored along with the representation?"
Yes, when a PDF is tagged a snippet of text knows that is is part of a title, or a paragraph, or a list,... But Tagged PDF is there to define the structure of the real content. Page numbers however, are not part of the real content. They are marked as artifacts along with headers, footers and other items on a page that are not considered being real content. There is no way to distinguish page numbers.
"Then what are these page labels about?" you ask.
Well, page labels are optional. They are present in some PDFs that are well conceived, but they will be absent in a large majority of the PDFs you'll find in the wild.
This is the long answer. The short answer is simple: You are asking for something that is impossible (in general, not only with iText, Tika, PdfBox, or any other tool you might try).