We're launching the release of iText 5.5.5 on iText's 15th birthday: February 14, 2015 and if we take a look at the changelog, we see that we're making progress in a couple of related areas. Returning topics are text extraction, compare tool and clean-up functionality. The common denominator of these three evolutions is PDF parsing.
We have fixed a bug that was introduced very recently and that was reported by many different developers. This bug caused an ArrayIndexOutOfBoundsException when an empty String was encountered. We improved parsing of Type3 fonts and we improved parsing of Unicode fonts. Documents that used to result in nothing but white space, are now converted to actual text. Of course: there will always be broken PDFs that can never be parsed correctly, no matter which tool you use.
As parsing improves, we can come up with better ways to compare PDFs. We have a large suite of tests that generate PDF documents and then compare the result with a reference PDF. This isn't trivial as two PDFs that are created using the same code are never identical by design (that's inherent to the PDF specification). There are different strategies one can use to compare such PDFs and we are constantly improving our compare tool by adding new strategies.
This is a work in progress that was started in iText 5.5.4. At that time, the development was done in the Java version of iText only. We have now ported this functionality to C# and we have improved the functionality. For instance: whereas we only allowed redaction of text in 5.5.4, we now also allow redaction of images. The work isn't finished yet: we still have several edge cases to fix and redaction of curves isn't covered yet.
XML Worker, XFA Worker and RUPS
Apart from these three major areas, we also fixed a number of issues that were reported. If you use the XMP functionality, it is important for you to upgrade to protect yourself against the XML External Entity (XXE) attacks that were reported on December 31, 2014. We also fixed some problems related to Tagged PDF and fonts.
Finally, let's take a look at the other iText projects:
- We won't make a RUPS release, because no new functionality was added to RUPS.
- The new XML Worker version will only bring a handful of improvements (that might be important if you use tables)
- As usual, users provided us with a new series of XFA forms with special features that we integrated into a new XFA Worker release.
That's all for iText 5.5.5. Let's start working on iText 5.5.6!
iText 5.5.5 Core
Bugs and improvements
Text extraction. Handled specific case where a font's charSpace width is compensated with negative character spacing and results in a 0 width, causing LocationTextExtractionStrategy.getResultantText() to assume that there's a space after every character.
- Clean-up functionality. Fixed incorrect handling of the " operator.
- Clean-up functionality. Added possibility to recover text by character widths
- Tagged PDF. Fixed NPE when modifying content of Tagged PDF document.
- Security issue. Protecting against XEE attacks
- XML Worker. Fixed some div width calculation issues.
- XML Worker. Support for Div text-align.
- XML Worker. Arabic content was loosing html styles after converting from HTML to PDF
Word hyphenation. Fix getting word boundaries: digits were not taken into account, so that "att5ention" word was split into "att" and "ention", and then hyphenation event was called only for "ention" part, but it should be called for the whole word "att5ention".
- AcroForms. Fix regenerating AcroFields appearances for check boxes (Duplicate appearance in flattened check box).
- Barcodes. Fix to Barcode128 CodeSet parameter backwards compatibility.
- Barcodes. Edited the BarcodeQRCode constructor description to inform users that UTF-8 encoding can be used (it is not guaranteed, however, that all the decoders will decode such barcode correctly because UTF-8 is not supported by the specification).
- Barcodes. BarcodePDF417: add placeBarcode method for placing a barcode right on PdfContentByte.
- CompareTool. generate more verbose report on differences,
- CompareTool. new configuration: setCompareByContentErrorsLimit method for setting maximum number of comparison errors,
- CompareTool. setGenerateCompareByContentXmlReport method for generating xml report on differences.
- CompareTool. include StructTreeRoot into compareByContent.
- CompareTool. add offsets to item path for string and streams failed comparison.
- CompareTool. fix false positive issues in: compare the dictionaries over the union of their keys, not only by the cmp dict keys.
- CompareTool. Include comparison of dictionaries in the comparison of streams.
- CompareTool. Comparison of /OCProperties entry in catalog is called in compareByContent.
- Text extraction. Support for Identity CMap in DocumentFont (metrics were not filled properly).
- CompareTool. fix compare console commands for paths containing spaces.
- Clean-up functionality. Added processing for partial glyph covering.
- Clean-up functionality. Added processing for image covering.
- Clean-up functionality. Words were shifted after cleaning up.
- Clean-up functionality. Fixed incorrect storing of graphic state parameters.
- Clean-up functionality. Fixed issues with Form XObjects when removing content.
- Clean-up functionality. Fixed NullPointerException caused by incorrect image redaction: the redacted image was written to a content stream as a new image, but the old one was deleted, causing an exception when the image was used elsewhere in the same PDF.
- Clean-up functionality. Smasks were destroyed if we redacted masked image.
Text extraction. Introducing a fillDiffMap() method in DocumentFont, so that the functionality to extract the Differences array that is used for Type1/TrueType fonts, can also be used for Type3 fonts.
- AcroForms. Apparently, there are forms where the encoding isn't stored in the font, but in the resource dictionary. The ISO specification doesn't mention that this is possible, but this commit looks for such an encoding if none is specified for the font.
- Text extraction. Support for the Identity CMap.
- Fix. When adding a table is added to a ColumnText object, the original table instance is altered (e.g. a fixed height of a cell is introduced). This causes problems if you first add a table in simulation mode, and then try adding the table for real.
- Fonts. When a Type 1 font is not embedded, we should not subset it. If we do, we risk adding 0 as the width of glyphs that are not used in an appearance, but that could be used in the context of a field (such as an option in a Choice field). Adding the correct Widths array will result in a file size that is substantially higher, but only in cases where we use an encoding that is different from the standard encoding.
- AcroForms. If an annotation isn't really an annotation, but for instance a field that doesn't have a widget annotation, the /P entry shouldn't appear in the field dictionary.
Changes made by Michaël Demey
- Fonts. Updated the documentation to reflect the experimental status of support for Devanagari.
- Text extraction. Implemented an extra check to avoid parsing the empty String (casing an ArrayIndexOutOfBoundsException).
- Tagged PDF. setAccessibleAttribute() did not have the desired effect when used with PdfName.ID. A tagged PDF with ID entries must also have an IDTree.
- Tagged PDF. Provided a better solution for table summary (in PDF-UA).
- Tagged PDF. An element in Tagged PDF must be able to add the Title (PdfName.T) directly to its root. Also updated documentation for a few PdfName literals, because they were confusing: LANG vs LANGUAGE, ALT vs ALTERNATE, etc. See SUP-802 for the trigger.
- Fix. PDF/A-1: fix outputintent RGB check.
- Fonts. Acces table GSUB is directly by a RandomAccessFileOrArray.
- Fonts. GSUB Lookup type 1 Format 2.
XML Worker 5.5.5
Added support for shorthand border property in tables: border-bottom and etc.
- Fixed some table width calculation issues.
- Fix for table colspan and fixed widths error.
- Support for Div text-align
- Arabic content was loosing html styles after converting from HTML to PDF
- Added better support for run direction (RTL or LTR) in nested tables, and via a CSS property instead of a tag attribute.
XFA Worker 5.5.5
(This is a closed source project on top of iText and XML Worker.)
Solved missing image problem (due to wrong subform positioning) and missing page problem.
- Added support for shorthand border property in tables: border-bottom and etc.
- Fixed a problem where an embedded font was removed by XFA Worker
- Form package parsing, pre-implementation of form package handling (not yet active; research only)
- Invalid XFA color causes problem during flattening process
- Fix: auto-sized text wasn't flattened.
- Added hyphenation support.
- Fixed data binding issues: check for possibility to bind siblings before duplicating parent subform
- Support for PDF417 barcodes.
- Deal with tabulation chunks in barcodes.
- Fixed calculate script evaluation: do not update rawValue if the result is a function.
- Process Linethrough font tag attribute in fields.
- Improve 1D barcode properties and sizing support.
- Fix bug with barcode fields overflowing.
- Fixed NullPointerException in getSignatureFields() in case of flattening from XDP package.
- Fixed paragraph margins for right-to-left text elements.
- Dealing with infinite event executions in case of recursive subform instantiation.
- Enable resolving of $record entries in SomExpressions.
- Added search for base fonts if font directory is specified in XFAFontSettings.
- Fix: Flattening of signature element didn't work in case of unnamed subforms.
- White space contained in a spacerun-style element wasn't preserved correctly.
- Fixed problem with excess white space.