iText

How to use a text extraction strategy after applying a location extraction strategy?

I used the following code to get data in PDF from a particular location.

Java
Rectangle rect = new Rectangle(0,0,250,250);
RenderFilter filter = new RegiontextRenderFilter(rect);
fontBasedTextExtractionStrategy strategy = new fontBasedTextExtractionStrategy();
strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter); //Throws Error.

I want to get the bold text present in that location. Would creating a new method or class called FontBasedTextExtractionStrategy instead of a simple TextExtractionStrategy help?

Posted on StackOverflow on Jul 1, 2014 by Raka

Please take a look at the ParseCustom example for iText 7. In this example, we create a custom TextRegionEventFilter (not ITextExtractionStrategy):

##GITHUB:https://github.com/itext/i7js-examples/blob/develop/src/main/java/com/itextpdf/samples/sandbox/parse/ParseCustom.java##

C#
        protected class CustomFontFilter : TextRegionEventFilter
        {
            public CustomFontFilter(Rectangle filterRect):base(filterRect)
            {
            }

            public override bool Accept(IEventData data, EventType type)
            {
                if (type.Equals(EventType.RENDER_TEXT))
                {
                    TextRenderInfo renderInfo = (TextRenderInfo) data;
                    PdfFont font = renderInfo.GetFont();
                    if (null != font)
                    {
                        string fontname = font.GetFontProgram().GetFontNames().GetFontName();
                        return fontname.EndsWith("Bold") || fontname.EndsWith("Oblique");
                    }
                }

                return false;
            }
        }

This will filter only the text where the PostScript font name ends with Bold or Oblique.

This is how you use this filter:

Java
protected void manipulatePdf(byte[] bytes) throws IOException {
        PdfDocument pdfDoc = new PdfDocument(new PdfReader(new ByteArrayInputStream(bytes)));
 
        Rectangle rect = new Rectangle(36, 750, 523, 56);
        CustomFontFilter fontFilter = new CustomFontFilter(rect);
        FilteredEventListener listener = new FilteredEventListener();
 
        // Create a text extraction renderer
        LocationTextExtractionStrategy extractionStrategy = listener
                .attachEventListener(new LocationTextExtractionStrategy(), fontFilter);
 
        // Note: If you want to re-use the PdfCanvasProcessor, you must call PdfCanvasProcessor.reset()
        PdfCanvasProcessor parser = new PdfCanvasProcessor(listener);
        parser.processPageContent(pdfDoc.getFirstPage());
 
        // Get the resultant text after applying the custom filter
        String actualText = extractionStrategy.getResultantText();
 
        pdfDoc.close();
}
C#
public void manipulatePdf(byte[] bytes)
        {
            PdfDocument pdf = new PdfDocument(new PdfReader(new MemoryStream(bytes)));
            Rectangle rect = new Rectangle(100, 100, 200, 200);
            CustomFontFilter fontFilter = new CustomFontFilter(rect);
            FilteredEventListener listener = new FilteredEventListener();

			// Create a text extraction renderer
            LocationTextExtractionStrategy strat = listener.AttachEventListener(new LocationTextExtractionStrategy(),fontFilter);

			// Note: If you want to re-use the PdfCanvasProcessor, you must call PdfCanvasProcessor.reset()
            PdfCanvasProcessor parser = new PdfCanvasProcessor(listener);
            parser.ProcessPageContent(pdf.GetFirstPage());
            
			// Get the resultant text after applying the custom filter
            String actualText = strat.GetResultantText();
            Console.Out.WriteLine(actualText);
			pdf.Close();
        }

As you can see, we create a LocationTextExtractionStrategy that takes our self-made filter based on the font. To extract text we use processPageContent().

Click this link if you want to see how to answer this question in iText 5.