Adding Bookmarks / Table of Contents to a pdfHTML conversion document
Introduction
In many PDF documents, a Table of Contents and bookmarks are essential for navigating in large documents. With iText and pdfHTML you can automate the addition of Table of Content and bookmarks during HTML to PDF conversion, making documents more user-friendly.
Key Concepts
Bookmarks: These are clickable links within PDF, usually headings or sections.
Table of Contents: List of headings with clickable links that refer to corresponding sections in the document.
We will leverage HtmlConverter to convert HTML into a PDF and use PdfOutlines to create interactive links for easy navigation. Additionally we also generate a Table of Contents (ToC) by parsing the HTML and associating specific sections with page numbers and adding internal links.
Explanation
HTML Parsing & Table of Content creation:
We begin by parsing the HTML file using Jsoup, then creating a container for Table of Contents (ToC) inside the body of the HTML. The idea of this is to dynamically allow us to insert a Table of Contents container without editing the original HTML file. This container will hold the Table of Contents entries that link to the headings of the document.
Java ☕ :
CODE
Document htmlDoc = Jsoup.parse(new File(SRC), "UTF-8");
// This is our Table of Contents aggregating element
Element tocElement = htmlDoc.body().prependElement("div");
tocElement.append("<b>Table of contents</b>");
.Net :
CODE
Document htmlDoc = Jsoup.Parse(new FileInfo(SRC), "UTF-8");
// This is our Table of Contents aggregating element
Element tocElement = htmlDoc.Body().PrependElement("div");
tocElement.Append("<b>Table of contents</b>");
Building a Table of Contents (ToC) and adding IDs
We now iterate over all the <h2> elements in the document. For each heading here, we assign a unique ID if it already does not have one. The purpose of doing this is to ensure that the Table of Contents (ToC) will create clickable links that refer to these headings by their IDs.
ID Assignment: This essentially will allow us to assign unique IDs for each heading allowing us to reference them in the PDF and link to them from the Table of Contents (ToC)
CSS Styling: The Dynamic CSS ensures that page numbers are displayed alongside ToC entries which will be updated when the PDF is generated.
Java Code Sample :
JAVA
// We are going to build a complex CSS
StringBuilder tocStyles = new StringBuilder().append("<style>");
try (PdfDocument pdfDocument = new PdfDocument(new PdfWriter(DEST))) {
PdfOutline bookmarks = pdfDocument.getOutlines(false);
Elements tocElements = htmlDoc.select("h2");
for (Element elem : tocElements) {
// Here we create an anchor to be able to refer to this element when generating page numbers and links
String id = elem.attr("id");
if (id == null || id.isEmpty()) {
id = generateId();
elem.attr("id", id);
}
// CSS selector to show page numbers for a TOC entry
tocStyles.append("*[data-toc-id=\"").append(id)
.append("\"] .toc-page-ref::after { content: target-counter(#").append(id).append(", page) }");
// Generating TOC entry as a small table to align page numbers on the right
Element tocEntry = tocElement.appendElement("table");
tocEntry.attr("style", "width: 100%");
Element tocEntryRow = tocEntry.appendElement("tr");
tocEntryRow.attr("data-toc-id", id);
Element tocEntryTitle = tocEntryRow.appendElement("td");
tocEntryTitle.appendText(elem.text());
Element tocEntryPageRef = tocEntryRow.appendElement("td");
tocEntryPageRef.attr("style", "text-align: right");
// <span> is a placeholder element where target page number will be inserted
// It is wrapped by an <a> tag to create links pointing to the element in our document
tocEntryPageRef.append("<a href=\"#" + id + "\"><span class=\"toc-page-ref\"></span></a>");
}
.NET Code Sample
C#
// We are going to build a complex CSS
StringBuilder tocStyles = new StringBuilder().Append("<style>");
using (PdfDocument pdfDocument = new PdfDocument(new PdfWriter(DEST)))
{
PdfOutline bookmarks = pdfDocument.GetOutlines(false);
Elements tocElements = htmlDoc.Select("h2");
foreach (Element elem in tocElements)
{
// Here we create an anchor to be able to refer to this element when generating page numbers and links
String id = elem.Attr("id");
if (string.IsNullOrEmpty(id))
{
id = generateId();
elem.Attr("id", id);
}
// CSS selector to show page numbers for a TOC entry
tocStyles.Append("*[data-toc-id=\"").Append(id)
.Append("\"] .toc-page-ref::after { content: target-counter(#").Append(id).Append(", page) }");
// Generating TOC entry as a small table to align page numbers on the right
Element tocEntry = tocElement.AppendElement("table");
tocEntry.Attr("style", "width: 100%");
Element tocEntryRow = tocEntry.AppendElement("tr");
tocEntryRow.Attr("data-toc-id", id);
Element tocEntryTitle = tocEntryRow.AppendElement("td");
tocEntryTitle.AppendText(elem.Text());
Element tocEntryPageRef = tocEntryRow.AppendElement("td");
tocEntryPageRef.Attr("style", "text-align: right");
// <span> is a placeholder element where target page number will be inserted
// It is wrapped by an <a> tag to create links pointing to the element in our document
tocEntryPageRef.Append("<a href=\"#" + id + "\"><span class=\"toc-page-ref\"></span></a>");
}
Adding Bookmarks for Navigation
For each heading, we create a corresponding bookmark in the PDF document.
PdfOutline: This class lets us create a new outline (bookmark) using the text of the heading elem.text().
PdfAction.createGoTo(id) : This lets us link bookmark to corresponding heading identified by the id. When the user clicks on a bookmark in PDF, they are takin to a relevant section.
Finalizing HTML with CSS and converting it to PDF
After generating the Table of Contents (ToC) and bookmarks, we append the CSS styles into the HTML <head> and convert the modified HTML to a PDF. This conversion creates the final document with all the required formatting, Table of Contents (ToC) and interactive elements.
Java ☕ :
CODE
tocStyles.append("</style>");
htmlDoc.head().append(tocStyles.toString());
String html = htmlDoc.outerHtml();
ConverterProperties converterProperties = new ConverterProperties().setImmediateFlush(false);
HtmlConverter.convertToDocument(html, pdfDocument, converterProperties).close();
.NET :
CODE
tocStyles.Append("</style>");
htmlDoc.Head().Append(tocStyles.ToString());
String html = htmlDoc.OuterHtml();
ConverterProperties converterProperties = new ConverterProperties().SetImmediateFlush(false);
HtmlConverter.ConvertToDocument(html, pdfDocument, converterProperties).Close();
Appending CSS : The dynamically generated CSS code here is appended to the <head> of the HTML document.
htmlDoc.outerHtml(): Converts the modified HTML document back to a string, now with the Table of Contents, styles and IDs.
HtmlConverter.convertToDocument() : This converts the final HTML ( which includes our ToC and styles) into a PDF document.
This will ensure that there are headings in the document are uniquely identifiable, enabling proper links in the Table of Contents (ToC) and bookmarks.
Sample Image for Table of Contents
To summarize, Once we have dynamically inserted the Table of Contents container in a HTML document using Jsoup we can programmatically assign random ID’s (In this case we will use predefined ID’s) and generate bookmarks for navigation. We then convert the modified HTML into a PDF document that includes the Table of Contents and interactive elements like bookmarks. This process automates the creation of structures, navigable PDFs that are converted using pdfHTML library.
Resources :
☕ Java Sample Code :
Example Code in Java using iText 9.1.0 and pdfHTML 6.1.0
JAVA
import com.itextpdf.html2pdf.ConverterProperties;
import com.itextpdf.html2pdf.HtmlConverter;
import com.itextpdf.kernel.pdf.*;
import com.itextpdf.kernel.pdf.action.PdfAction;
import com.itextpdf.styledxmlparser.jsoup.Jsoup;
import com.itextpdf.styledxmlparser.jsoup.nodes.Document;
import com.itextpdf.styledxmlparser.jsoup.nodes.Element;
import com.itextpdf.styledxmlparser.jsoup.select.Elements;
import java.io.File;
import java.util.ArrayList;
import java.util.List;
public class Bookmark {
public static final String DEST = "sample_output.pdf";
public static final String SRC = "original_file.html";
private static final List<String> IDS = new ArrayList<>();
static {
IDS.add("random_id_1");
IDS.add("random_id_2");
IDS.add("random_id_3");
IDS.add("random_id_4");
IDS.add("random_id_5");
IDS.add("random_id_6");
}
public static void main(String[] args) throws Exception {
new Bookmark().manipulatePdf();
}
public void manipulatePdf() throws Exception {
Document htmlDoc = Jsoup.parse(new File(SRC), "UTF-8");
// This is our Table of Contents aggregating element
Element tocElement = htmlDoc.body().prependElement("div");
tocElement.append("<b>Table of contents</b>");
// We are going to build a complex CSS
StringBuilder tocStyles = new StringBuilder().append("<style>");
try (PdfDocument pdfDocument = new PdfDocument(new PdfWriter(DEST))) {
PdfOutline bookmarks = pdfDocument.getOutlines(false);
Elements tocElements = htmlDoc.select("h2");
for (Element elem : tocElements) {
// Here we create an anchor to be able to refer to this element when generating page numbers and links
String id = elem.attr("id");
if (id == null || id.isEmpty()) {
id = generateId();
elem.attr("id", id);
}
// CSS selector to show page numbers for a TOC entry
tocStyles.append("*[data-toc-id=\"").append(id)
.append("\"] .toc-page-ref::after { content: target-counter(#").append(id).append(", page) }");
// Generating TOC entry as a small table to align page numbers on the right
Element tocEntry = tocElement.appendElement("table");
tocEntry.attr("style", "width: 100%");
Element tocEntryRow = tocEntry.appendElement("tr");
tocEntryRow.attr("data-toc-id", id);
Element tocEntryTitle = tocEntryRow.appendElement("td");
tocEntryTitle.appendText(elem.text());
Element tocEntryPageRef = tocEntryRow.appendElement("td");
tocEntryPageRef.attr("style", "text-align: right");
// <span> is a placeholder element where target page number will be inserted
// It is wrapped by an <a> tag to create links pointing to the element in our document
tocEntryPageRef.append("<a href=\"#" + id + "\"><span class=\"toc-page-ref\"></span></a>");
// Add bookmark
PdfOutline bookmark = bookmarks.addOutline(elem.text());
bookmark.addAction(PdfAction.createGoTo(id));
}
tocStyles.append("</style>");
htmlDoc.head().append(tocStyles.toString());
String html = htmlDoc.outerHtml();
ConverterProperties converterProperties = new ConverterProperties().setImmediateFlush(false);
HtmlConverter.convertToDocument(html, pdfDocument, converterProperties).close();
}
}
private static String generateId() {
// Usually random id can be generated, but for the purpose of testing we will use predefined ids.
return IDS.isEmpty() ? null : IDS.remove(0);
}
}
💠 . NET Sample Code :
Example Code in .NET using iText 9.1.0 and pdfHTML 6.1.0
C#
using System;
using System.Collections.Generic;
using System.IO;
using System.Text;
using iText.Html2pdf;
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Action;
using iText.StyledXmlParser.Jsoup;
using iText.StyledXmlParser.Jsoup.Nodes;
using iText.StyledXmlParser.Jsoup.Select;
public class DynamicallyAddToCAndBookmarksHtml
{
static string DEST = @"..\..\..\Bookmarks\lib\sample-output.pdf";
static string SRC = @"..\..\..\Bookmarks\lib\original_file.html";
private static readonly IList<string> IDS = new List<string>();
static DynamicallyAddToCAndBookmarksHtml()
{
IDS.Add("random_id_1");
IDS.Add("random_id_2");
IDS.Add("random_id_3");
IDS.Add("random_id_4");
IDS.Add("random_id_5");
IDS.Add("random_id_6");
}
public static void Main(string[] args)
{
FileInfo file = new FileInfo(DEST);
file.Directory.Create();
Console.WriteLine(SRC);
new DynamicallyAddToCAndBookmarksHtml().ManipulatePdf();
}
public void ManipulatePdf()
{
Document htmlDoc = Jsoup.Parse(new FileInfo(SRC), "UTF-8");
// This is our Table of Contents aggregating element
Element tocElement = htmlDoc.Body().PrependElement("div");
tocElement.Append("<b>Table of contents</b>");
// We are going to build a complex CSS
StringBuilder tocStyles = new StringBuilder().Append("<style>");
using (PdfDocument pdfDocument = new PdfDocument(new PdfWriter(DEST)))
{
PdfOutline bookmarks = pdfDocument.GetOutlines(false);
Elements tocElements = htmlDoc.Select("h2");
foreach (Element elem in tocElements)
{
// Here we create an anchor to be able to refer to this element when generating page numbers and links
String id = elem.Attr("id");
if (string.IsNullOrEmpty(id))
{
id = generateId();
elem.Attr("id", id);
}
// CSS selector to show page numbers for a TOC entry
tocStyles.Append("*[data-toc-id=\"").Append(id)
.Append("\"] .toc-page-ref::after { content: target-counter(#").Append(id).Append(", page) }");
// Generating TOC entry as a small table to align page numbers on the right
Element tocEntry = tocElement.AppendElement("table");
tocEntry.Attr("style", "width: 100%");
Element tocEntryRow = tocEntry.AppendElement("tr");
tocEntryRow.Attr("data-toc-id", id);
Element tocEntryTitle = tocEntryRow.AppendElement("td");
tocEntryTitle.AppendText(elem.Text());
Element tocEntryPageRef = tocEntryRow.AppendElement("td");
tocEntryPageRef.Attr("style", "text-align: right");
// <span> is a placeholder element where target page number will be inserted
// It is wrapped by an <a> tag to create links pointing to the element in our document
tocEntryPageRef.Append("<a href=\"#" + id + "\"><span class=\"toc-page-ref\"></span></a>");
// Add bookmark
PdfOutline bookmark = bookmarks.AddOutline(elem.Text());
bookmark.AddAction(PdfAction.CreateGoTo(id));
}
tocStyles.Append("</style>");
htmlDoc.Head().Append(tocStyles.ToString());
String html = htmlDoc.OuterHtml();
ConverterProperties converterProperties = new ConverterProperties().SetImmediateFlush(false);
HtmlConverter.ConvertToDocument(html, pdfDocument, converterProperties).Close();
}
}
private static String generateId()
{
// Usually random id can be generated, but for the purpose of testing we will use predefined ids.
string id = IDS.Count == 0 ? null : IDS[0];
IDS.RemoveAt(0);
return id;
}
}