Skip to main content
Skip table of contents

How to use a text extraction strategy after applying a location extraction strategy?

I used the following code to get data in PDF from a particular location.

Rectangle rect = new Rectangle(0,0,250,250); RenderFilter filter = new RegiontextRenderFilter(rect); fontBasedTextExtractionStrategy strategy = new fontBasedTextExtractionStrategy(); strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter); //Throws Error.

I want to get the bold text present in that location. Would creating a new method or class called FontBasedTextExtractionStrategy instead of a simple TextExtractionStrategy help?

Posted on StackOverflow on Jul 1, 2014 by Raka

Please take a look at the ParseCustom example for iText 7. In this example, we create a custom TextRegionEventFilter (not ITextExtractionStrategy):

JAVA
protected class CustomFontFilter extends TextRegionEventFilter {
        public CustomFontFilter(Rectangle filterRect) {
            super(filterRect);
        }
 
        @Override
        public boolean accept(IEventData data, EventType type) {
            if (type.equals(EventType.RENDER_TEXT)) {
                TextRenderInfo renderInfo = (TextRenderInfo) data;
                PdfFont font = renderInfo.getFont();
                if (null != font) {
                    String fontName = font.getFontProgram().getFontNames().getFontName();
                    return fontName.endsWith("Bold") || fontName.endsWith("Oblique");
                }
            }
 
            return false;
        }
    }
C#
        protected class CustomFontFilter : TextRegionEventFilter
        {
            public CustomFontFilter(Rectangle filterRect):base(filterRect)
            {
            }

            public override bool Accept(IEventData data, EventType type)
            {
                if (type.Equals(EventType.RENDER_TEXT))
                {
                    TextRenderInfo renderInfo = (TextRenderInfo) data;
                    PdfFont font = renderInfo.GetFont();
                    if (null != font)
                    {
                        string fontname = font.GetFontProgram().GetFontNames().GetFontName();
                        return fontname.EndsWith("Bold") || fontname.EndsWith("Oblique");
                    }
                }

                return false;
            }
        }

This will filter only the text where the PostScript font name ends with Bold or Oblique.

This is how you use this filter:

JAVA
protected void manipulatePdf(byte[] bytes) throws IOException {
        PdfDocument pdfDoc = new PdfDocument(new PdfReader(new ByteArrayInputStream(bytes)));
 
        Rectangle rect = new Rectangle(36, 750, 523, 56);
        CustomFontFilter fontFilter = new CustomFontFilter(rect);
        FilteredEventListener listener = new FilteredEventListener();
 
        // Create a text extraction renderer
        LocationTextExtractionStrategy extractionStrategy = listener
                .attachEventListener(new LocationTextExtractionStrategy(), fontFilter);
 
        // Note: If you want to re-use the PdfCanvasProcessor, you must call PdfCanvasProcessor.reset()
        PdfCanvasProcessor parser = new PdfCanvasProcessor(listener);
        parser.processPageContent(pdfDoc.getFirstPage());
 
        // Get the resultant text after applying the custom filter
        String actualText = extractionStrategy.getResultantText();
 
        pdfDoc.close();
}
C#
public void manipulatePdf(byte[] bytes)
        {
            PdfDocument pdf = new PdfDocument(new PdfReader(new MemoryStream(bytes)));
            Rectangle rect = new Rectangle(100, 100, 200, 200);
            CustomFontFilter fontFilter = new CustomFontFilter(rect);
            FilteredEventListener listener = new FilteredEventListener();

			// Create a text extraction renderer
            LocationTextExtractionStrategy strat = listener.AttachEventListener(new LocationTextExtractionStrategy(),fontFilter);

			// Note: If you want to re-use the PdfCanvasProcessor, you must call PdfCanvasProcessor.reset()
            PdfCanvasProcessor parser = new PdfCanvasProcessor(listener);
            parser.ProcessPageContent(pdf.GetFirstPage());
            
			// Get the resultant text after applying the custom filter
            String actualText = strat.GetResultantText();
            Console.Out.WriteLine(actualText);
			pdf.Close();
        }

As you can see, we create a LocationTextExtractionStrategy that takes our self-made filter based on the font. To extract text we use processPageContent().

Click this link if you want to see how to answer this question in iText 5.

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.