Creating Well-Tagged PDF Documents with pdfHTML

The PDF Association recently published their new Well-Tagged PDF (WTPDF) specification; which aims to define how to represent reusable and accessible electronic documents in PDF 2.0 files across a wide spectrum of possible use-cases. The concept of Tagged PDF was introduced with the PDF 1.4 specification, and WTPDF builds on the concept by utilizing PDF 2.0-specific features.

WTPDF also forms the basis of the PDF/A-4 and PDF/UA-2 standards for archiving and universal accessibility, and you can find more information on these standards and their relationship to WTPDF and PDF 2.0 in our dedicated article on the iText site.

As participants in the PDF Association and ISO PDF committee working groups which develop these standards, Apryse has contributed an example document to demonstrate the iText PDF SDK’s conformance to the WTPDF specification. You can find this example PDF on the PDF Association’s WTPDF page. As a bonus, our example also conforms to the PDF/A-4 and PDF/UA-2 standards, and was generated from a HTML template using the pdfHTML add-on for iText Core.

To find out how the example document was created, read on.

Why Convert From HTML?

Two of the recent additions to iText Core’s arsenal of supported PDF standards were PDF/A-4 (ISO 19005-4) in version 8.0.2 and PDF/UA-2 (ISO 14289-2) in version 8.0.3. This means you can create such documents from scratch just using iText’s high-level document creation APIs. Indeed, we have up-to-date code examples demonstrating the use of the PdfADocument and PdfUADocument sub-classes of PdfDocument in our Java and .NET Sandbox repositories on GitHub.

However, as noted in Chapter 3 and Chapter 4 of our pdfHTML tutorial, you can make your life easier when creating PDF/A and PDF/UA documents by using HTML/XML templates. This is a great idea because pdfHTML can reuse the semantic and structural information in your HTML/XML and CSS, and map it to the appropriate iText objects and styles - saving potentially heaps of time manually tagging PDF content to achieve conformance with the required standards.

In our article https://itextpdf.com/blog/technical-notes/easier-pdfa-pdfhtml, we explained in detail the advantages of pdfHTML for creating archivable PDF/A-3B documents. Many of the concepts discussed there are applicable to generating documents conforming to other PDF/A conformance levels, Tagged PDF and PDF/UA-1.

For WTPDF and its related PDF/A-4 and PDF/UA-2 standards, achieving conformant output is a little different since these are purely PDF 2.0 standards; meaning we can take advantage of PDF 2.0-specific features and upgrades to Tagged PDF. As noted by the PDF Association in their article, the re-development of the Tagged PDF section was one of the most significant upgrades in PDF 2.0.

Below, you can find the code and associated resources we used to create our WTPDF example document. As mentioned earlier, the generated PDF also conforms to both PDF/A-4 and PDF/UA-2. Feel free to adapt and experiment with the code for your own purposes.

Generate a WTPDF File with pdfHTML

We’ll first need some suitable HTML to use as a template, so what could be more appropriate than the PDF Association’s own article on their efforts in advancing accessibility?

Here’s how if looks on the web:

Source HTML

HTML

<html lang="en-US">
<head>
    <meta charset="UTF-8">
    <title>The PDF Association’s work to advance accessibility – PDF Association</title>
    <meta content="width=device-width, initial-scale=1" name="viewport">
    <style>
        p {
            font-size: 15px!important;
        }
        a {
            font-size: 15px;
            color: #d03f4e;
        }
        li{
            font-size: 15px;
        }
        li::marker{
            color: #d03f4e;
        }

    </style>
</head>

<!--feature limit we dont support marker-->
<body>
<div style="margin-left: 60px;margin-right: 60px">
    <table style="width: 100%">
        <tr>
            <td>
                <h1 style="margin-top:0px;font-size: x-large">The PDF Association’s work to advance accessibility</h1>
            </td>
            <td>
                <img alt="A word-cloud of terms related to assistive technology." height="100"
                     src="assistive-tech-word-cloud-300x150.png">
            </td>
        </tr>
    </table>
    <p style="font-size: small">
        A major focus of the PDF Association’s work is to increase awareness and adoption of
        standards and best practices for accessibility.
    </p>
    <hr style="margin-bottom:15px;margin-top:15px;">
    <p>
        <strong style="font-weight: bolder">About the author:</strong> The PDF Association staff delivers a
        vendor-neutral platform for PDF’s stakeholders, facilitating the
        development of open specifications and ISO standards for PDF technology.
        Staff members include: Alexandra Oettler&nbsp;(Editor), Betsy Fanning ...
        <a aria-label="Read more about PDF Association staff" class="read-more"
           href="https://pdfa.org/people/pdf-association/"
           title="PDF Association staff">Read more</a>
    </p>


    <table>
        <tr>
            <td>
                <img alt="PDF Association staff"
                     decoding="async"
                     src="PDF-Association-Logo-sq-150x150.png"
                     style="  object-fit: cover;width:120px;height:120px;border-radius: 60px 60px 60px 60px;-webkit-box-shadow: 1px -1px 6px 0px #aaa;box-shadow:  1px -1px 6px 0px #aaa;">
            </td>
            <td>
                <div style="width: 300px"></div>
            </td>
            <td>
                <p style="width: 100%;text-align: right">
                    January 19, 2024
                </p>
            </td>
        </tr>
    </table>
    <hr>
</div>


<div  style="margin-right: 40px;margin-left: 40px;"><p>This article surveys the PDF Association’s focus on
    accessibility, from advancing accessible PDF to promoting accessibility in ISO standards
    documents, and ensuring broad access and opportunity for all within its own
    operations.</p>
    <p>Founded in 2006 as the “PDF/A Competence Center”, since 2010 the organization has
        grown to encompass the PDF format as a whole. Today, the PDF Association provides a
        vendor-neutral meeting-place for PDF technology stakeholders while working to
        increase awareness and adoption of ISO-standardized PDF technology, including for
        accessibility.</p>
    <h2>ISO 14289-1 (PDF/UA-1), the ISO standard for accessible PDF</h2>
    <p>In 2012, after eight years of development led by PDF Association members, ISO
        published <a href="https://pdfa.org/resource/iso-14289-pdfua/">ISO 14289-1</a>,
        better known as PDF/UA (Universal Accessibility), to define requirements for the use
        of PDF’s Tagged PDF feature as defined for PDF 1.7 (2008). PDF/UA-1 was published in
        the early stages of the development of PDF 2.0, which began in 2009, and the lessons
        gained in creating PDF/UA-1 were put to use in the redevelopment of Tagged PDF in
        PDF 2.0.</p>
    <p>The PDF Association caused PDF/UA-1 to be the first ISO standard worldwide, on any
        subject, to itself meet standards for accessibility. As a tagged and validated PDF
        file, ISO 14289-1 thus conformed, not only with WCAG 2.0, but also with itself. 😉</p>
    <h2>Support for NVDA</h2>
    <p>Also in 2012, the PDF Association celebrated the publication of PDF/UA by <a
            href="https://pdfa.org/nvda-goes-pdfua-the-pdf-association-steps-up-to-fund-development-of-the-worlds-first-pdfua-conforming-screen-reader/">supporting
        development of NVDA</a>, the free and open source screen reader.</p>
    <h2>Best practice, test-suites and more</h2>
    <p>Since the publication of PDF/UA-1 the <a
            href="https://pdfa.org/community/pdf-ua-technical-working-group/">PDF/UA TWG</a>
        and <a href="https://pdfa.org/community/pdf-accessibility-liaison-working-group/">PDF
            Accessibility LWG</a> have developed a variety of resources to assist
        developers, end users and other stakeholders interested in driving accessibility in
        PDF content. The results of these efforts include:</p>
    <ul>
        <li><a href="https://pdfa.org/resource/the-matterhorn-protocol/">The Matterhorn
            Protocol</a>, now at version 1.1, is a set of tests covering all of PDF/UA-1’s
            requirements.
        </li>
        <li><a href="https://pdfa.org/resource/tagged-pdf-best-practice-guide-syntax/">The
            Tagged PDF Best Practice Guide: Syntax</a>, provides advice beyond the technical
            requirements defined in PDF/UA.
        </li>
        <li><a href="https://pdfa.org/resource/pdfua-reference-suite/">The PDF/UA Reference
            Suite</a>, a set of “real-world” PDF files usable as a demonstration of how PDF
            files should be tagged.
        </li>
        <li><a href="https://pdfa.org/resource/iso-ts-32005-hierarchical-inclusion-rules/">Hierarchical
            inclusion rules for PDF 1.7 and PDF 2.0 structure elements</a> used by <a
                href="https://pdfa.org/resource/iso-32005/">ISO 32005</a>.
        </li>
        <li>The <a href="https://pdfa.org/glossary-of-accessibility-terminology-in-pdf/">glossary
            of accessibility terminology in PDF</a></li>
    </ul>
    <h2>Tagged PDF redefined in ISO 32000-2 (PDF 2.0)</h2>
    <p>Led by the then-chairman of the PDF Association, callas software’s <a
            href="https://pdfa.org/people/olaf-drummer/">Olaf Drümmer</a>, one of PDF 2.0’s
        most significant upgrades was the re-development of the section defining Tagged PDF,
        setting the stage for PDF/UA-2.</p>
    <h2><a id="using-tagged-pdf"></a>Well-Tagged PDF (WTPDF)</h2>
    <p>PDF accessibility is one of several use-cases for “reuse” of tagged PDF. Other
        use-cases include expression as HTML, copy and paste functionality, content
        extraction for use by search engines, and more.</p>
    <p>Following publication of the 2nd edition of PDF/UA-1 in 2014, the PDF Association’s
        <a href="https://pdfa.org/community/pdf-ua-technical-working-group/">PDF/UA
            Technical Working Group</a>, later joined by the <a
                href="https://pdfa.org/community/pdf-reuse-twg/">PDF Reuse TWG</a>, began to
        develop a new specification for tagged PDF based on PDF 2.0.</p>
    <p>Intended for publication by the PDF Association, "<a href="https://pdfa.org/wtpdf/">Well-Tagged
        PDF (WTPDF): Using Tagged PDF for Accessibility and Reuse in PDF 2.0</a>" was
        developed in full alignment with ISO TC 171 SC 2 WG 9, the working group developing
        ISO 14289-2, to ensure its compatibility with the forthcoming ISO standard for
        accessible PDF 2.0 files.</p>
    <h2>ISO 14289-2 (PDF/UA-2)</h2>
    <p>As mentioned above, the text of PDF/UA-2 (to be published in Q1 of 2024) was
        developed by PDF Association working-groups working in coordination with ISO’s TC
        171 SC 2 WG 9. Comments from the ISO committee members were fed back into the PDF
        Association's development process to ensure 100% alignment between the new ISO
        standard and the PDF Association’s specification for Well-Tagged PDF.</p>
    <p>Building on PDF 2.0, PDF/UA-2 is a dramatic improvement on PDF/UA-1. For the first
        time it includes comprehensive provisions for annotations and structure element
        attributes, both of which are mostly absent in PDF/UA-1. PDF/UA-2 also leverages PDF
        2.0 in many other ways, from the new Namespaces feature that allows for integration
        of PDF 1.7 and PDF 2.0 structure elements in the same document to MathML, the new
        Artifact structure element type, and much more.</p>
    <h2>Encouraging accessible ISO standards</h2>
    <p>Since 2014 the PDF Associations’s PDF/UA Technical Working Group has operated in
        close coordination with the ISO working group responsible for PDF/UA, TC 171 SC 2 WG
        9. Starting in 2021, members of these groups came together to conduct an assessment
        of ISO’s products and procedures from the accessibility perspective. This work
        culminated in 2022 with delivery to ISO of a report by the TC 171 SC 2 Chair
        Advisory Group (CAG) identifying areas of concern and making recommendations for
        enhancements. The PDF Association continues working actively with ISO to identify
        and mitigate accessibility issues with ISO’s document production and committee
        workflows.</p>
    <h2>Working Groups with an accessibility focus</h2>
    <p>As of early 2024 the PDF Association operates 20 active Working Groups. Of these, 4
        are directly engaged in advancing accessibility in PDF technology while 2 other
        groups are dedicated to subjects that track closely with accessibility.</p>
    <h3>PDF Association WGs dedicated to accessibility</h3>
    <p><a href="https://pdfa.org/community/pdf-ua-technical-working-group/">PDF/UA TWG </a>Together
        with the PDF Reuse TWG, this WG led development of the specification for Well-Tagged
        PDF, the which is mirroned in the file format requirements in ISO 14289-2
        (PDF/UA-2).</p>
    <p><a href="https://pdfa.org/community/pdf-accessibility-liaison-working-group/">PDF
        Accessibility LWG </a>This WG develops techniques for accessible PDF, with its first
        set of “fundamental” techniques slated for publication in Q1 2024. In October 2023
        the working group published an <a
                href="https://pdfa.org/pdf-techniques-for-accessibility-a-new-model/">example</a>
        of what is to come.</p>
    <p><a href="https://pdfa.org/community/pdf-ua-processor-lwg/">PDF/UA Processor LWG </a>This
        WG focuses on recommendations and requirements for software engaged in processing
        PDF/UA files. As of 2024 the group is focused on examining accessibility API role
        mappings for HTML elements and WAI-ARIA / DPub attributes with the objective of
        mapping these features to their functional equivalents in PDF. <a
                href="https://pdfa.org/bridging-pdf-and-web-accessibility/">Read more</a>
        about their progress.</p>
    <p><a href="https://pdfa.org/community/latex-project-lwg/">LaTeX Project LWG </a>The <a
            class="extlink https" href="https://www.latex-project.org/" rel="noopener"
            target="_blank">LaTeX Project<sup></sup></a> is working to enhance the
        LaTeX typesetting system used by academic and technical authors worldwide to deliver
        complete support for the creation of structured document formats, in particular,
        Tagged PDF and PDF/UA. This LWG provides a workspace for LaTeX developers andPDF
        experts to share their expertise to advance both LaTeX and Tagged PDF</p>
    <h3>PDF Association WGs related to accessibility</h3>
    <p><a href="https://pdfa.org/community/pdf-reuse-twg/">PDF Reuse TWG </a>In addition to
        its collaboration with the PDF/UA TWG on the development of Well-Tagged PDF and
        PDF/UA-2, this WG focuses on general reuse of Tagged PDF to deliver advanced support
        for conversion to HTML, copy and paste applications and other instances of content
        reuse.</p>
    <p><a href="https://pdfa.org/community/deriving-html-from-pdf-twg/">Deriving HTML from
        PDF TWG </a>Author of the usage specification “<a
            href="https://pdfa.org/resource/deriving-html-from-pdf/">Deriving HTML from
        PDF</a>” published in 2019, this working group focuses on leveraging Tagged PDF for
        this common content reuse case.</p>
    <h2>An ongoing commitment to accessibility</h2>
    <p>PDF’s core value proposition implies extreme flexibility in the representation of
        content. Accordingly, although accessible PDF has been possible for a long time,
        achieving it still offers challenges, depending largely on document complexity.
        Accordingly, the PDF Association will remain committed to developing resources that
        help all stakeholders, software developers, institutions and end users alike, to
        develop software to support accessible PDF, set policies for publication and
        acquisition, and author PDF files that meet accessibility standards.</p>
</div>
</body>
</html>

Now we’ll need to gather the required resources to create a WTPDF-conformant file. You can find these in the attached zip file under the Resources heading below. However, we recommend checking out the wtpdf-demo repository on our GitHub to ensure you have all the required dependencies.

Example Code

Let’s take a look at the App.java code to understand what it is actually doing:

JAVA

package com.itextpdf.wtpdfsample;

import com.itextpdf.html2pdf.ConverterProperties;
import com.itextpdf.html2pdf.HtmlConverter;
import com.itextpdf.html2pdf.attach.ITagWorker;
import com.itextpdf.html2pdf.attach.ProcessorContext;
import com.itextpdf.html2pdf.attach.impl.DefaultTagWorkerFactory;
import com.itextpdf.html2pdf.attach.impl.tags.HTagWorker;
import com.itextpdf.html2pdf.resolver.font.DefaultFontProvider;
import com.itextpdf.kernel.pdf.PdfAConformanceLevel;
import com.itextpdf.kernel.pdf.PdfDocumentInfo;
import com.itextpdf.kernel.pdf.PdfOutputIntent;
import com.itextpdf.kernel.pdf.PdfString;
import com.itextpdf.kernel.pdf.PdfVersion;
import com.itextpdf.kernel.pdf.PdfViewerPreferences;
import com.itextpdf.kernel.pdf.PdfWriter;
import com.itextpdf.kernel.pdf.WriterProperties;
import com.itextpdf.kernel.xmp.XMPException;
import com.itextpdf.kernel.xmp.XMPMeta;
import com.itextpdf.kernel.xmp.XMPMetaFactory;
import com.itextpdf.layout.IPropertyContainer;
import com.itextpdf.layout.element.Div;
import com.itextpdf.layout.element.IElement;
import com.itextpdf.layout.element.Paragraph;
import com.itextpdf.pdfa.PdfADocument;
import com.itextpdf.styledxmlparser.node.IElementNode;

import java.io.ByteArrayInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.Arrays;
import java.util.HashSet;
import java.util.Set;

/**
 * Hello world!
 */
public class App {

    private static final String SOURCE_FOLDER = "./src/main/resources/";
    private static final Set<String> H_TAGS = new HashSet<>(Arrays.asList("h1", "h2", "h3", "h4", "h5", "h6", "h7"));

    public static void main(String[] args) throws IOException, XMPException {
        String outFile = "wtpdf.pdf";
        PdfOutputIntent outputIntent = new PdfOutputIntent(
                "Custom",
                "",
                "http://www.color.org",
                "sRGB IEC61964-2.1",
                Files.newInputStream(Paths.get(SOURCE_FOLDER + "sRGB Color Space Profile.icm")));


        WriterProperties writerProperties = new WriterProperties().setPdfVersion(PdfVersion.PDF_2_0);
        PdfADocument pdfDocument = new PdfADocument(new PdfWriter(outFile, writerProperties), PdfAConformanceLevel.PDF_A_4, outputIntent);

        DefaultTagWorkerFactory factory = new DefaultTagWorkerFactory() {
            @Override
            public ITagWorker getCustomTagWorker(IElementNode tag, ProcessorContext context) {
                if (H_TAGS.contains(tag.name())) {
                    return new HTagWorker(tag, context) {
                        @Override
                        public boolean processTagChild(ITagWorker childTagWorker, ProcessorContext context) {
                            return super.processTagChild(childTagWorker, context);
                        }

                        @Override
                        public IPropertyContainer getElementResult() {
                            IPropertyContainer elementResult = super.getElementResult();
                            if (elementResult instanceof Div) {
                                for (IElement child : ((Div) elementResult).getChildren()) {
                                    if (child instanceof Paragraph) {
                                        ((Paragraph) child).setNeutralRole();
                                    }
                                }
                            }
                            return elementResult;
                        }
                    };
                }
                return super.getCustomTagWorker(tag, context);
            }
        };

        // setup the general requirements for a wtpdf document
        byte[] bytes = Files.readAllBytes(Paths.get(SOURCE_FOLDER + "simplePdfUA2.xmp"));
        XMPMeta xmpMeta = XMPMetaFactory.parse(new ByteArrayInputStream(bytes));
        pdfDocument.setXmpMetadata(xmpMeta);
        pdfDocument.setTagged();
        pdfDocument.getCatalog().setViewerPreferences(new PdfViewerPreferences().setDisplayDocTitle(true));
        pdfDocument.getCatalog().setLang(new PdfString("en-US"));
        PdfDocumentInfo info = pdfDocument.getDocumentInfo();
        info.setTitle("Well tagged PDF document");

        // Use custom font provider as we only want embedded fonts
        DefaultFontProvider fontProvider = new DefaultFontProvider(false, false, false);
        fontProvider.addFont(SOURCE_FOLDER + "NotoSans-Regular.ttf");
        fontProvider.addFont(SOURCE_FOLDER + "NotoEmoji-Regular.ttf");

        ConverterProperties converterProperties = new ConverterProperties()
                .setBaseUri(SOURCE_FOLDER)
                .setTagWorkerFactory(factory)
                .setFontProvider(fontProvider);


        File file = new File(SOURCE_FOLDER + "article.html");
        try (FileInputStream str = new FileInputStream(file)) {
            HtmlConverter.convertToPdf(str, pdfDocument, converterProperties);
        }
        pdfDocument.close();
        System.out.println("WTPDF created");
    }
}

You’ll notice we select PdfVersion.PDF_2_0 in the WriterProperties, since WTPDF requires PDF 2.0 conformance.

Much like the PDF/A-3B example from the article mentioned earlier, we must supply the fonts to be embedded. You’ll note that since our source document uses an emoji, we need to embed NotoEmoji-Regular as well as the NotoSans-Regular font.

We must also set the output intent and provide an appropriate color profile, as well as the necessary XMP metadata to embed into the file. For more information on XMP, see https://kb.itextpdf.com/itext/how-to-add-metadata-to-a-pdf-using-pdfhtml.

We also set the PdfAConformanceLevel to PDF/A-4, and since we want a Tagged PDF, we also use pdfDocument.setTagged() to add a hierarchical structure tree to the PDF. We also want to conform to PDF/UA-2, and so we must set properties for the document’s title and language, along with viewer preferences.

Validating the Results

And that’s it! However, we should validate our document to make sure it passes the required checks for PDF/A-4 and PDF/UA-2. To verify our output we can run our created document through the veraPDF Conformance Checker, which in the current release (1.26.2) supports both PDF/A-4 and PDF/UA-2 validation:

As you can see, our WTPDF document passes all the checks for PDF/A-4 and PDF/UA-2.

Resources

GitHub Repository

wtpdf-resources.zip

Results

apryse-itext-wtpdf.pdf