How to use a text extraction strategy after applying a location extraction strategy?
I used the following code to get data in PDF from a particular location.
Rectangle rect = new Rectangle(0,0,250,250); RenderFilter filter = new RegiontextRenderFilter(rect); fontBasedTextExtractionStrategy strategy = new fontBasedTextExtractionStrategy(); strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter); //Throws Error.
I want to get the bold text present in that location. Would creating a new method or class called
FontBasedTextExtractionStrategy
instead of a simpleTextExtractionStrategy
help?Posted on StackOverflow on Jul 1, 2014 by Raka
Please take a look at the ParseCustom example for iText 7. In this example, we create a custom TextRegionEventFilter
(not ITextExtractionStrategy
):
protected class CustomFontFilter extends TextRegionEventFilter {
public CustomFontFilter(Rectangle filterRect) {
super(filterRect);
}
@Override
public boolean accept(IEventData data, EventType type) {
if (type.equals(EventType.RENDER_TEXT)) {
TextRenderInfo renderInfo = (TextRenderInfo) data;
PdfFont font = renderInfo.getFont();
if (null != font) {
String fontName = font.getFontProgram().getFontNames().getFontName();
return fontName.endsWith("Bold") || fontName.endsWith("Oblique");
}
}
return false;
}
}
protected class CustomFontFilter : TextRegionEventFilter
{
public CustomFontFilter(Rectangle filterRect):base(filterRect)
{
}
public override bool Accept(IEventData data, EventType type)
{
if (type.Equals(EventType.RENDER_TEXT))
{
TextRenderInfo renderInfo = (TextRenderInfo) data;
PdfFont font = renderInfo.GetFont();
if (null != font)
{
string fontname = font.GetFontProgram().GetFontNames().GetFontName();
return fontname.EndsWith("Bold") || fontname.EndsWith("Oblique");
}
}
return false;
}
}
This will filter only the text where the PostScript font name ends with Bold or Oblique.
This is how you use this filter:
protected void manipulatePdf(byte[] bytes) throws IOException {
PdfDocument pdfDoc = new PdfDocument(new PdfReader(new ByteArrayInputStream(bytes)));
Rectangle rect = new Rectangle(36, 750, 523, 56);
CustomFontFilter fontFilter = new CustomFontFilter(rect);
FilteredEventListener listener = new FilteredEventListener();
// Create a text extraction renderer
LocationTextExtractionStrategy extractionStrategy = listener
.attachEventListener(new LocationTextExtractionStrategy(), fontFilter);
// Note: If you want to re-use the PdfCanvasProcessor, you must call PdfCanvasProcessor.reset()
PdfCanvasProcessor parser = new PdfCanvasProcessor(listener);
parser.processPageContent(pdfDoc.getFirstPage());
// Get the resultant text after applying the custom filter
String actualText = extractionStrategy.getResultantText();
pdfDoc.close();
}
public void manipulatePdf(byte[] bytes)
{
PdfDocument pdf = new PdfDocument(new PdfReader(new MemoryStream(bytes)));
Rectangle rect = new Rectangle(100, 100, 200, 200);
CustomFontFilter fontFilter = new CustomFontFilter(rect);
FilteredEventListener listener = new FilteredEventListener();
// Create a text extraction renderer
LocationTextExtractionStrategy strat = listener.AttachEventListener(new LocationTextExtractionStrategy(),fontFilter);
// Note: If you want to re-use the PdfCanvasProcessor, you must call PdfCanvasProcessor.reset()
PdfCanvasProcessor parser = new PdfCanvasProcessor(listener);
parser.ProcessPageContent(pdf.GetFirstPage());
// Get the resultant text after applying the custom filter
String actualText = strat.GetResultantText();
Console.Out.WriteLine(actualText);
pdf.Close();
}
As you can see, we create a LocationTextExtractionStrategy
that takes our self-made filter based on the font. To extract text we use processPageContent()
.
Click this link if you want to see how to answer this question in iText 5.