Legacy notice!

iText 5 is the previous major version of iText's leading PDF SDK. iText 5 has been EOL, and is no longer developed. Switch your project to iText 7, integrating the latest developments.
Check related iText 7 content!

I'm facing a problem when trying to export a Vietnamese document as PDF using iText.

I put Vietnamese words in .xml file like this

td fontfamily="Helvetica" fontstyle="0" fontsize="9" align="0" colspan="48" lineoccupied="1">
    T\u1ED5 ch\u1EE9c tham gia
/td>

I convert this into Unicode, exporting the String to PDF using encoding UTF-8, but the program fails to display te Vietnamese characters '\u1ED5' and '\u1EE9' and the output becomes "T chc tham gia".

Posted on StackOverflow on Feb 28, 2014 by NTLC

There are several XML Worker examples involving Asian languages on the official iText web site. They parse an XHTML file containing Chinese characters, but it should be easy to adapt them to Vietnamese examples.

You can find the HTML files were going to parse here:

Both files contain the following text:

?? (Broken Sword), ???? (Flying Snow), ?? (Moon), ?? (the King), and ?? (Sky).

In the first case, a font is defined using CSS:

??

In the second case, no specific font is defined:


 

?? (Broken Sword), ???? (Flying Snow), ?? (Moon), ?? (the King), and ?? (Sky).

These files contain UTF-8 characters, so we're going to parse them like this:

XMLWorkerHelper.getInstance().parseXHtml(writer, document,
            new FileInputStream(HTML), Charset.forName("UTF-8"));

The first thing you need, is a font that supports Vietnamese characters. That's something iText can't help you with. In your HTML file, you've defined Helvetica, but that's a standard Type1 font that is never embedded when using iText and that doesn't know how to draw Vietnamese glyphs. That's never going to work.

The first example D07_ParseHtmlAsian will automatically search for a font named MS Mincho. If it finds that font (for instance because you have msmincho.ttc in your Windows fonts directory), the font will show up in your PDF. See hero.pdf. If it doesn't find a font with that name, then the glyphs won't be visible, because you didn't provide any font program for those glyphs.

The second example D07bis_ParseHtmlAsian offers a workaround in case you don't have MS Mincho anywhere. In that case, you have to use an XMLWorkerFontProvider and register a font that can be used instead of MS Mincho. For instance: we use a font stored in the file cfmingeb.ttf and assign the alias MS Mincho:

XMLWorkerFontProvider fontProvider = new XMLWorkerFontProvider(XMLWorkerFontProvider.DONTLOOKFORFONTS);
fontProvider.register("resources/fonts/cfmingeb.ttf", "MS Mincho");

The resulting file asian.pdf is slightly different from what we expect, but now we can at least see the Chinese glyphs.

In the third example, D07tris_ParseHtmlAsian, the HTML file doesn't tell us anything about the font that needs to be used. We'll define the font using CSS like this:

CSSResolver cssResolver = new StyleAttrCSSResolver();
CssFile cssFile = XMLWorkerHelper.getCSS(
    new ByteArrayInputStream("body {font-family:tsc fming s tt}".getBytes()));
cssResolver.addCss(cssFile);

Now, all the text in the body will use the font TSC FMing S TT (stored in the file cfmingeb.ttf). You can see the difference in the resulting PDF asian2.pdf.