Basic Concepts (Glossary)
If you are new at Blockchain and PDF, you'll find this short glossary interesting.
Concepts related to Blockchain:
Distributed database: we talk about a distributed database when the storage devices aren't attached to one central processing unit, but are spread across a network. Some examples include:
-
NoSQL databases, with well-known implementations such as MongoDB and CouchDB,
-
Hadoop, which is an open source framework for storing data and running applications on clusters of hardware devices,
-
Distributed Ledger Technology,
-
...
Node: a node is a connection point in a distributed network that can receive, create, store or send data from and to other nodes in that network.
Ledger: a ledger is a collection of permanent, final, definitive records of transactions.
Ledger record: a ledger record is an entry in the ledger containing information about one or more transactions.
Distributed ledger technology (DLT): DLT is a type of distributed database technology with the following characteristics:
-
The records can be replicated over multiple nodes in a network (decentralized environment),
-
New records can be added by each node, upon consensus reached by other nodes (ranging from one specific authoritative node to potentially every node),
-
Existing records can be validated for integrity, authenticity, and non-repudiation,
-
Existing records can't be removed, nor can their order be changed,
-
The different nodes can act as independent participants that don't necessarily need to trust each other.
Combined, these characteristics make DLT a great way to keep a ledger of records in a trustless environment.
Blockchain: blockchain is a type of DLT in which records are organized in blocks that are appended to a single chain using cryptography and distributed consensus. Each block contains a timestamp and a link to a previous block. This ensures that data in any given block can't be altered retroactively without the alteration of all subsequent blocks. This approach makes blockchain technology a good choice for the recording of events, records management, provenance tracking, and document lifecycle management.
Centralized, decentralized, or distributed storage: there are blockchain systems where a single instance of the ledger is stored on a central server that acts as the broker of the data. Usually, the data lives on different nodes. In the case of decentralized ledger storage, a copy of the ledger is stored on specific "super-nodes". In a distributed architecture, the ledger is replicated on every node.
Permissionless or permissioned blockchain: a permissionless blockchain is a DLT system where no authorization or authentication is needed, and nodes and users are unknown. In a permissioned blockchain, nodes must have a member identity; authorization and authentication is mandatory.
Public or private blockchain: in a public blockchain, any node can join to read blocks and records, append records, and to participate in the consensus mechanism. In a private blockchain only nodes that have been granted authority have that access.
Centralized, decentralized, or distributed ledger control: in case of centralized control, one authority, e.g. a central server, decides on the validation of a new block of records. With decentralized control, a central authority delegates the validation of new blocks to a limited number of nodes. In a distributed architecture, all the nodes work together using a consensus mechanism to validate a new block.
Consensus mechanism: a consensus mechanism is an agreement among all the nodes regarding the validity and consistency of the records and blocks that are being added to the blockchain. The consensus mechanism also guarantees the order of the records in a distributed ledger. A consensus mechanism can be implemented in many different ways (e.g. in the context of Bitcoin, a proof-of-work is needed), but that would go beyond the scope of this ref card.
Although blockchain isn't a synonym for DLT, the industry started using blockchain as the common name for all kinds of distributed ledger technologies, probably because blockchain sounds easier than DLT, and is a more catchy word to market.
You'll read articles criticizing Blockchain (or DLT) arguing that it's slow. Authors of such articles often refer to Bitcoin, which is a public, decentralized system that uses a proof-of-work. They overlook that our examples are completely agnostic of the type of Blockchain that is used. A permissioned Blockchain with centralized storage and centralized ledger control without a consensus mechanism is much faster.
Blockchain shouldn't be seen as a competitor to traditional database systems. Traditional databases will probably be faster and easier to use than distributed ledger technology. Moreover, Blockchain shouldn't be used to store data that you don't want to expose to the outside world. The most important advantage of using Blockchain when compared with other database systems, is the immutability of the records. Those records should be used to store metadata, such as the digest message of the bytes of a PDF document.
Concepts related to PDF
Document: a document is a piece of written, printed, or electronic matter that provides information or evidence or that serves as an official record.
Portable Document Format: the Portable Document Format (PDF) is a file format for capturing and sending electronic documents described in a series of standards, such as:
-
ISO 32000-2: the most recent core PDF specification,
-
ISO 19005-2 and 19005-3: the most recent standards for the long-term preservation of PDF documents, aka PDF/A,
-
ISO 14289-1: the standard for universal accessibility in the context of PDF, aka PDF/UA,
-
ETSI TS 102 778: a series of standards for PDF Advanced Electronic Signatures (PAdES)
-
ZUGFeRD: a standard for invoices that combines PDF/A-3 and the UN/CEFACT Cross Industry Invoice (CII) standard.
PDF ID: every PDF can be identified by an ID that consists of a pair of identifiers. The first identifier is created at the time a new PDF file is created, and it's permanent in the sense that it won't change when the PDF is updated. The second identifier is initially identical to the first part, but it changes each time the document is updated. The ID of a PDF needs to be unique. Two PDF documents with the same ID should be exact copies of each other; both files should contain the exact same bytes in the exact same order. If the two identifiers of a document's ID pair are identical, then you know that the document is a first version. If the first identifier of two different PDF documents is identical, but the second identifier is different, then you know that both documents are somehow related to each other.
These concepts will be important once you start reading about the mechanisms iText has patented.
About our Blockchain patents
The mechanisms outlined in our patents can be used for any type of file (documents in any format, raw data in JSON or XML format, and so on), but we've chosen to focus on PDF because the industry has accepted the format as being reliable, trustworthy, and ubiquitous. Furthermore, PDF has some unique features. The concept of the PDF ID is one example. The fact that PDF can be used as a container for different types of media (XML, video, audio,...) is another one.