Why can't I extract text added using a Type3 font correctly from a PDF?
I have PDF file in Arabic that has text with font Type3 when I extract text using PDFBox some characters are empty and their font equals null
? I want to know what is the problem.
protected void processTextPosition(TextPosition text) {
String character=text.getCharacter(); // is empty
String font=text.getFont().getBaseFont(); // equal null
}
The stream produced with iText looks like this: ( dJ? v{d W?cG?)Tj
Why do I get the characters in this format?
Question marks appear in my stream as "SOH-STX-ETX-EOT", not as one character. The character inside the PDF is shown as 'd' and 'J'!
Posted on StackOverflow on Feb 9, 2014 by Ayman Younis
A Type 3 font is a user-defined font. For instance: a user can define that the character 'P' corresponds with the symbol for "The Artist Formerly Known As Prince" which is a glyph, but not a letter from any known alphabet:
The TAFKAP symbol
A glyph in a Type 3 font is a series of lines and shapes, and there's no way for a program such as iText or PDFBox to determine which character was meant. It is only normal that you get a question mark.
One of the following reasons applies for a PDF that contains Type 3 fonts:
- The font was used to introduce symbols that don't exist in any font.
- The font was used to obfuscate the content of the PDF so that its content can't be extracted.
- The PDF wasn't created in an elegant way.
If the Type 3 font was used for normal characters, you'll need to use OCR to convert the content to normal text.