PDF and Digital Signatures

The digital signature technology, as described in ETSI's PDF Advanced Electronic Signatures (PAdES) standard and ISO's ISO-32000 standards, was created to meet specific requirements.

Requirements that PDF digital signatures need to meet

The purpose of the PDF digital signatures functionality is to ensure:

Integrity: we want assurance that the content of the document hasn't been changed somewhere in the workflow,
Authenticity: we want assurance about the origin of the document, e.g. about the identity of the author or the instance that issued the document,
Non-repudiation: we want assurance that the issuing party can't deny its authorship,
The time of signing: we want assurance about the date and time a document was signed.

ISO-32000-2 also introduces a concept that was first published in PAdES-4:

Long-term validation (LTV): we want the assurance that the integrity, authenticity, non-repudiation, and the time of signing can still be validated on the long term.

Let's find out how this works, step by step.

Creating a digital signature for PDF

In figure 1, we can see how the bytes are organized in a PDF file.

Figure 1: Digital Signature in PDF

When we sign a PDF document, we take all the bytes of the file, except for the area where the digital signature will be stored. We create a message digest using a cryptographic hash function from these bytes. We sign this hash value (depending on the exact signature type, together with additional attributes) using a private key, and we store that signed data along with some extra, unsigned information (the signer certificate, ...).

Signing is done using Public Key Infrastructure (PKI). The signer owns a key-pair, consisting of a public key and a private key. We talk about encryption when someone uses the public key to encrypt a message. Only the party who possesses the corresponding private key can decrypt the message. We talk about signing when someone uses the private key to encrypt a message. Everyone who has access to the public key can decrypt the message.

A document can be signed more than once, by different signers. This needs to happen sequentially to avoid that every new signer invalidates the previous signers' signatures. This is shown in figure 2.

Figure 2: Sequential Signatures

This document has three revisions. Revision 1 is signed by signer 1. Revision 2 is signed by signers 1 and 2. Revision 3 is signed by signers 1, 2, and 3. Every new signer signs all the previous signatures.

The minimum information that needs to be stored inside the signature consists of:

The signed message digest, and
The signer's certificate (containing the public key that corresponds with the private key that was used for signing).

Best practices also require the presence of:

The rest of the certificates in the certificate chain (leading up to the root certificate),
Revocation information, in the form of a Certificate Revocation List (CRL) or an Online Certificate Status Protocol (OCSP) response,
A timestamp.

We'll need all of these elements to verify the signature.

Are our initial requirements met?

When you receive a digitally signed PDF, you can verify the integrity of the document by hashing the bytes of the PDF (excluding the bytes of the signature itself), testing whether it matches the hash in the signed data, and checking the signed data cryptographically using the public key.

The identity of the signer is stored in the signer's certificate. If the certificate is self-signed, there is no way to verify the authenticity, or to get any assurance of non-repudiation. To solve this problem, the signer needs to involve a certificate authority (CA) who will vouch for the information about the signer's identity. The signer's certificate will be on one end of the certificate chain; the root certificate of the CA will be on the other end. When verifying a signature, we'll check the revocation information provided by the CA and all the trusted parties in-between. The signer won't be able to deny that he signed the document unless he can prove that his private key was compromised, in which case he should have revoked it. This meets the requirements of authenticity and non-repudiation.

Best practices also imply that you involve a Timestamp Authority (TSA). A TSA service accepts your signed hash and signs it with its own private key adding a timestamp that can be trusted. This meets the requirement regarding the time of signing.

This timestamp is stored inside the digital signature, but PAdES-4 also introduced the concepts of a Document Security Store (DSS) and Document Timestamp (DTS) signatures. If a signed document is missing Validation Related Information (VRI), this information can be added to the PDF in a DSS. See figure 3 where we add missing certificates and missing revocation information.

Figure 3: Adding a Document Security Store (DSS)

Digital signatures expire, either because certificates expire, or because the algorithms are proven to be flawed in which case the signature loses its trustworthiness.

For instance: SHA-1 once used to be considered as a safe cryptographic hash function, but it was deprecated by NIST in 2011 because it was considered broken in theory. In 2017, researchers from CWI and Google succeeded in breaking SHA-1 in practice. They succeeded in altering the content of a PDF file without breaking the signature proving the theoretical flaw in the SHA-1 algorithm.

With DTS signatures, we can extend the life of the digital signatures inside a document beyond their expiration. To achieve this, the most recent (valid!) VRI should be retrieved and added in a DSS, after which the document should be signed with a DTS. This signature should be created by TSA using the most recent hashing and encryption algorithms; see figure 4.

Figure 4: Adding a Document Timestamp Signature

Obviously, the DTS also expires after a certain time, hence the operation needs to be repeated to further extend the life of the signatures in the document. This is shown in figure 5.

Figure 5: Extending the Life of a Digital Signature

This is a condensed description of how digital signatures work in PDF. This functionality works, and there's no reason why it won't continue to work in the future. Nevertheless, you might have detected some potential problems.

Potential problems inherent to digital signatures and PDF

The adoption of digital signatures in PDF was slow, among others because one or more of the following reasons:

If you want people to trust the signatures in your document, you need to involve some central authorities, such as CAs and TSAs. It would be great if we could think of a system that doesn't require any central authorities.
Documents that need to be signed by different people can only be signed sequentially. Think of a conference call that requires a dozen people to sign an NDA before the call can take place. Person A receives the NDA document, signs it, and sends it to person B. Person B receives it, signs it, and sends it to person C. And so on. Now suppose that person D is on vacation the week before the conference call. In that case, all the people who come after person D can't sign until person D returns. There's a chance that person E, F, G, H,... won't be able to add their signature in time for the call. (This example is based on a true story.)
As documents are signed sequentially, person A will own a copy with a single signature, person B will own a copy with two signatures, and so on. If 12 people have to sign, person L will own a copy of the document with 12 signatures. It is then the responsibility of person L to send the document that contains all the signatures to all the previous signers. Unfortunately, this doesn't always happen. Often, people early in the chain only have copies that are missing one or more signatures. (This example is based on a recurring true story.)
Sometimes it's hard to know if you're signing the correct version of a document. Think of an agreement that was drafted in a process that involved tough negotiations. During the negotiations, different versions of the agreement were sent back and forth. There's a risk that eventually the wrong version of an agreement is signed due to a confusingly high number of versions created by different people introducing confusion version numbers. (As you may have guessed, this example is also based on a true story.)
LTV is a cumbersome process that heavily relies on central authorities. Suppose that you've kept a signature alive for a century, and you discover that you can't create a new DSS because of a sudden, perpetual unavailability of a TSA.

PDF documents that aren't signed often have the problem that you don't know if they can be trusted. Once the document is consumed offline, there is no green bar on top of the viewer that says: this document was served to you using the HTTPS protocol, and by the way: you can trust the domain from which this page is served.

If you have a URL to a PDF document, e.g. the address of the PDF reference, you can encounter the problem that the URL doesn't work anymore. You get a 404 error message, because the document was moved to another location. We call this link rot, and whoever needed to download the PDF reference might have experienced this problem first-hand because the PDF reference has changed address many times. There are centralized services such as doi.org that provide a Digital Object Identifier (DOI) that allows you to retrieve one or more currently active URLs to a resource. DOI allows you to register and maintain the location of a document for a fee. Wouldn't it be great if we could reduce that fee, and decentralize the service?

We can solve all of these problems, and probably even some problems we didn't even know we had, by introducing blockchain into the world of PDF documents.