How to extract embedded streams?
I have embedded a byte array into a PDF file, more specifically an AVI file in a RichMedia annotation. Now I am trying to extract that same array. How can I do this?
Posted on StackOverflow on May 17, 2015 by Itai Soudry
I have written a brute force method to extract all streams in a PDF and store them as a file without an extension (see Extracting objects from a PDF):
public static final String DEST = "./target/test/resources/sandbox/parse/extract_streams%s";
public static final String SRC = "./src/test/resources/pdfs/image.pdf";
@BeforeClass
public static void before() {
new File(DEST).getParentFile().mkdirs();
}
public static void main(String[] args) throws IOException {
before();
new ExtractStreams().manipulatePdf();
}
@Test
public void manipulatePdf() throws IOException {
PdfDocument pdfDoc = new PdfDocument(new PdfReader(SRC));
PdfObject obj;
List<Integer> streamLengths = new ArrayList<>();
for (int i = 1; i <= pdfDoc.getNumberOfPdfObjects(); i++) {
obj = pdfDoc.getPdfObject(i);
if (obj != null && obj.isStream()) {
byte[] b;
try {
b = ((PdfStream) obj).getBytes();
} catch (PdfException exc) {
b = ((PdfStream) obj).getBytes(false);
}
System.out.println(b.length);
FileOutputStream fos = new FileOutputStream(String.format(DEST, i));
fos.write(b);
streamLengths.add(b.length);
fos.close();
}
}
Assert.assertArrayEquals(new Integer[]{30965, 74}, streamLengths.toArray(new Integer[streamLengths.size()]));
pdfDoc.close();
}
Note that I get all PDF objects that are streams. I also use two different methods:
When I use
((PdfStream)obj).getBytes()
, iText will look at the filter. For instance: page content streams consists of PDF syntax that is compressed using/FlateDecode
. By using((PdfStream)obj).getBytes(false)
, you will get the uncompressed PDF syntax.Not all filters are supported in iText. Take for instance
/DCTDecode
which is the filter used to store JPEGs inside a PDF. Why and how would you "decode" such a stream? You wouldn't, and that's when we use((PdfStream)obj).getBytes(false)
which is also the method you need to get your AVI-bytes from your PDF.
This example already gives you the methods you'll certainly need to extract PDF streams. Now it's up to you to find the path to the stream you need. That calls for iText RUPS. With iText RUPS you can look at the internal structure of a PDF file. In your case, you need to find the annotations as is done in this question: How to change the zoom factor in link annotations?
You loop over the page dictionaries, then loop over the /Annots
array of this dictionary (if it's present), but instead of checking for /Link
annotations (which is what was asked in the question I refer to), you have to check for /RichMedia
annotations and from there examine the assets until you find the stream that contains the AVI file. RUPS will show you how to dive into the annotation dictionary.
Click this link if you want to see how to answer this question in iText 5.