1.1. Chapter

When talking about blockchain we are talking about a system.

Note

What is a system? Merriam-Webster defines system as “interacting or interdependent group of items forming a unified group”. This isn’t the best, or most concrete, definition so for this textbook we define system as a group of computers, nodes, trying to work on a singular application. This singular application can be as big or as small as it wants.

System architecture is a whole course in itself, at Cornell its CS4410, so we will not be going into any sort of depth on the topic. All we need to know is that systems can be structured in three basic ways:

  • Centralized: The whole network revolves and depends on a ‘leader’ node (i.e. Facebook).

  • Decentralized: The whole network revolves and depends on a group of ‘leader’ nodes (i.e. Tor).

  • Distributed: A network composed of equal peer nodes that are variously connected between each other. There is no ‘leader’ node (i.e. I.o.T).

Tip

The definitions of decentralized and distributed are often used interchangeably. You might find blockchain being called distributed on one website and decentralized on the next. Formally the two terms are different but for alot of online sources on the blockchain, including this textbook, they mean the same thing which is that the system isn’t centralized.

The three types all have their uses and there is no one type that is better overall. Saying that, we have to concede that an overwhelming majority of consumer-facing applications are built on centralized systems.

Why? Centralized systems are easy to control and run, so they are often the first choice for already centralized companies. But centralized systems aren’t perfect.

  • Centralized systems are fragile in that if the ‘leader’ node goes down so does the whole system.

  • Censorship is far easier to implement on a centralized system than on one of the others.

  • Centralized systems have limited privacy.

A good example of how centralized systems can go wrong was Facebook’s 2021 outage caused by a maintenance failure on their centralized servers.

Distributed and decentralized systems largely address the above, so we might think they are the solution. But they, too, have problems. A Byzantine problem.

1.1.1. The Byzantine General’s Problem

../../_images/Byzantine_Generals.png

Formally, the Byzantine General’s Problem (BGP) is a game theory problem plaguing distributed systems. The problem goes as follows:

A group of generals besieged the city of Byzantium. To take the city they have to attack all at once BUT:

  • The Generals can only communicate through unreliable messengers who might have been infiltrated by Byzantium.

  • Some generals may be traitors/liars.

There are many adaptations of this problem but the issue is the same: How do you work with a group of peers you don’t trust over communication that is unreliable.

This problem affects distributed systems because of the equality of nodes. There is no gatekeeper, anyone can join the system and communicate with whom ever they want. This means your peers may be malicious actors trying to subvert the system, or they may be honest nodes that just have very bad internet, so they get only some of the critical messages needed to support the system.

Although BGP is at heart a theoretical problem it has a number of real-world applications. These include:

  • Communication between nuclear submarines

  • An international team of developers trying to build a product (i.e. Whatsapp)

  • The Internet

One of the largest distributed/decentralized systems is after all the internet. Anyone can host a website and anyone can visit. This made managing payments very hard. Some websites use credit cards, but these can be misused by malicious websites, while others just trusted the customer would pay when the product would physically be delivered. In short BGP meant there was no clean solution to online payments.

1.1.2. Blockchain as a solution

On Halloween in 2008 a solution to the payment problem was announced under the whitepaper Bitcoin: A peer-to-peer electronic cash system. Although the author never used the term, the whitepaper created Blockchain, a technology and data-structure that solved BGP.

1.1.2.1. Storing the Data

For the remainder of this chapter we will be looking at a subset of the solution: storing the information the generals send each other. The following chapters will complete the solution.

We need a data-structure that can store information among a trustless system. This data-structure needs to address the following:

  • How the actual data will be structured.

  • How will data be added/edited.

  • How will we make sure the data is valid.

  • How will we make sure everyone has the same set of data.

Note

By trustless we don’t mean a system with 0 trust but rather a system you can trust even if you don’t trust the participants. This might seem paradoxical but its actually one of the wonders of the blockchain and in this chapter we will take a peak into how this is possible.

1.1.2.2. Conventional Methods

Traditionally, when we want to store big sets of data we look at databases. Databases are an organized collection of data often arranged into a set of tables. To look at whether we can purely use databases lets try to answer our 4 questions:

  • How will the data be structured? - In groups of inter-related tables.

  • Adding/editing the data? - As we are working with a distributed system which anyone can join the answer is anyone can add and edit data.

  • Validity of data? - As anyone can add/edit data we don’t really have a way to efficiently check if the added/edited data is correct.

  • Same version of data? - We have to centralize the system to the same copy of the database for everyone, a concurrency nightmare not to mention that it stops being a decentralized system.

Note

We intentionally put purely databases above. This is because databases are in many ways an inherent way to store data and many blockchain implementations do actually store their local version of the blockchain in a database. The above was meant to showcase that we can’t only depend on conventional methods, that is we need something extra, something new.

Clearly using just databases isn’t enough. Not having a gatekeeper is looking like a problem. By going over the questions we realize that conventional methods don’t quite cut it. Our ideal data-structure would have the following properties:

  • Is ordered.

  • Has clear rules on how data is added and who can do it.

  • Data in the structure can be easily verified by anyone

  • Is immutable, this means that once data is added it can’t be deleted or edited.

  • Can be easily shared among nodes in the system.

  • Anyone can contribute to it but can only submit valid data.

1.1.2.3. Blockchain Implementation (BADS)

The blockchain meets the requirements above through the following data-structure:

  • Data is ordered into blocks and structured by Merkle Trees.

  • Blocks are united through a linked list.

  • Blocks are added on a regular basis to the linked list and must be valid to be added.

  • Everyone has an identical version of the above and can propose new blocks that should be added.

This composite data structure is the Blockchain-as-a-data-structure (BADS)

Tip

We just used alot of new definitions like linked lists, merkle trees and blocks. If this looks overwhelming don’t worry we will cover all of them in the rest of the chapter. Once you have read what all of these terms mean it could be useful to go over the above again.

Linked Lists are pieces of data that point to each other thereby forming a chain. They are an incredibly important data structure in all of computer science, as anyone who has taken Cornell’s CS2110 or ever prepared for a S.W.E interview knows. A piece of data, also called a node but not to be confused with nodes in a system, either points to another node or to nothing, but importantly a node can only point to one other thing and is pointed to by a maximum of one other node. This means two different nodes cannot point to the same node.

Linked Lists have the following benefits:

  • Clear process to add data, you just point your new data to the most recent data already in the linked list.

  • Immutable. To replace one node is to replace all the nodes that point to it either directly or indirectly.

As such linked lists have solved several of our desired B.A.D.S qualities. Namely: An ordered data structure,Clear process to add data,Immutable.

Note

Linked Lists, in their basic form, have ‘soft’ immutability. By this we mean it is a nuisance to delete data, but it can still be done, and editing is unaffected. But, as we will see, Blockchain’s linked list has ‘strong’ immutability. By principle, Blockchain linked lists are add-only, so if you are to act honestly you will only every add nodes, not edit or replace them.

../../_images/Linked_Lists.png

Blocks are the nodes in BADS’s linked list. Blocks are composed of a header and data. The data the block contains must be validated by data in preceding blocks. Blocks point to each other by storing the hash, a sort of data signature, of the previous block, there is no ‘physical’ link between the nodes.

Blocks and the linked list give the technology its name as: Block + Chain.

The former part coming from nodes and the latter part coming from the fact that linked lists ‘chain’ data together. We explored the expanding tail of the blockchain linked list but what about its head? What is the ‘first’ block and what does it point to? In blockchain, we call this block the genesis block.

Genesis Block is the oldest block in the linked list and the only one that points to nothing. It comes hard-coded in most blockchain implementations and is the only block that is inherently true. This means our trust in the blockchain is partly based on our trust that the data in the genesis block is valid. Genesis blocks often don’t contain alot of information and are often used for grand statements, like Bitcoin’s.

The Times 03/Jan/2009 Chancellor on brink of second bailout for banks – Bitcoin Genesis Block

Blocks and the linked list address another desired characteristic of BADS: sharing the data structure is efficient. To update the data structure we only need to share the latest block, each node can then add this block to its own private and much larger blockchain. The initial blockchain ‘download’ is still slow, but now it only has to be done once. We still need to guarantee that every node gets the same block and that its linked list is in the correct order. This is called reaching consensus and Blockchain achieves it through several consensus algorithms like PoW or PoS.

We often mentioned above that data in a block has to valid and that only the genesis block has the luxury of being automatically correct. But what does this actually mean?

For those who have done some discrete math, i.e. have done Cornell’s CS2800, this process is similar to induction where the genesis block is the base case and adding blocks is the inductive case.

For those who haven’t done any discrete math we can look at it as follows:

  • The data cannot be proven incorrect by the data in preceding blocks.

  • The data must have the same structure as data in the preceding blocks.

The above is only a generalization as the exact definition of what it means to be correct depends on blockchain.

Example

Lets look at a theoretical blockchain that tracks how much legos you own.

  • The genesis block says you have 1 lego

  • The block after genesis says you lost 1 lego

  • If any block after that says you lost more legos the data in that block is invalid as you can’t have less than 0 blocks.

We now have a way to organize data, a block chain, but how do we organize data within blocks? This is an important question as we need to guarantee the following:

  • The data must be clearly ordered so different nodes can have the same version

  • The data has to have a ‘signature’ that we can use to quickly check if it’s the same as the data of other nodes

  • The data has to be immutable

  • We need to guarantee that data is unaltered when it is sent from one node to another, this is to protect against unreliable or malicious communication channels.

Using purely a database would introduce a smaller version of our original problem and a linked list would be too complicated. Fortunately we have a data structure that addresses all of our demands, a Merkle Tree.

On Trees

Merkle Trees are a type of tree. Trees are an incredibly useful and widespread data structure and their use-cases go far beyond blockchain. Although we will not cover trees in depth we need to understand their basic architecture and some key terminology. Trees work much like linked lists in that they are composed of nodes pointing to other nodes. But in the case of trees multiple nodes, the children, can point to the same node, the parent. Nodes that have no children are called leaves. All but one node point have a parent node, the one that doesn’t is called the root. You can get from any node in the tree to the root. If we plot this relationship it looks like a tree, hence the name. Binary trees are trees where every parent can have at most 2 children, this means a node can be pointed to by a maximum of two other nodes.

../../_images/Merkle_T.png

Merkle Tree is a binary tree where every node has a hash, digital signature, which is the hash of the hashes of its two children. The actual data is stored in leaves whose hash makes up the first layer of nodes that eventually connect to the Merkle root which is a unique digital signature of the whole set of data. This data signature has a given size, but it can represent a data set of infinite size. We cannot understate the importance of this data structure in blockchain. It appears everywhere in nearly every implementation and add-on, after BADS Merkle Trees are the data structure of the blockchain.

Merkle Trees largely rely on hashes which, in a simplified sense, are a unique fixed-size signature of the data provided. The idea that every piece of data, be it an image or just a combination of other hashes, has a unique fixed-sized representation is what gives Merkle Trees their power. Hashing is a fascinating and integral part of blockchain and cryptography in general. We will cover it in depth in the next chapter.

Merkle Trees have multiple benefits and features that address our desires for storing a concrete data:

  • Thanks to the merkle root we can quickly determine whether two sets of data, perhaps stored by two different nodes with unreliable connections, are exactly the same, in both structure and content.

  • The merkle root also makes it very easy to see if anyone edited the data as the root would change, and it is very difficult, to impossible, to keep the changed data but revert the root back to its old version. This makes merkle trees immutable.

  • Changing a single piece of data means the whole merkle tree becomes invalid and to make it work again you need to recalculate all the hashes to get a new valid root.

  • The merkle root acts as a signature of the data provided so, in some cases, rather than send all the data we just send the much smaller merkle root.

Data within blocks is therefore stored in merkle trees, and we can represent this data through a small fixed-sized merkle root that is, almost, guaranteed to be unique.

We talked alot about the data stored by the block but are still missing its ‘brain’, the part that references the preceding block and contains information about the block , like the merkle root. The blocks ‘brain’ is the block header.

Block Header defines the block’s properties. Every blockchain implementation has a unique header structure. Regardless a very generalized block header contains the following:

  • Version - The version of the blockchain implementation when the block was added. Like all software, blockchain implementations have updates and so different versions that can affect how we determine whether a block is valid.

  • Preceding Block Hash - The hash of the latest block before this one. This creates the linked list.

  • Merkle Root of Data - The root of the merkle tree storing all the block’s data. This is very useful as a node can quickly determine whether its local version of the data is valid.

  • Timestamp - The timestamp of when the block was added to the blockchain.

All the fields of the block header are filled out by the creator of the block. This concludes our discussion of the core BADS. A linked list composed of blocks as nodes which store concrete data into merkle trees and are coordinated by their headers.

We now know how to build a blockchain as a data structure, but we don’t know to run it. The discussion above has left us with questions like who adds and validates blocks and how is ‘strong’ immutability implemented into BADS.

1.1.2.4. Maintaining BADS

Let’s start with adding and validating blocks and tackle immutability later. Adding and validating blocks forces us to introduce another aspect of Blockchain, networks.

Networks are in the most simple sense just a set of connected computers that share data. Systems work on top of networks. Blockchain in particular is heavily dependent on networks as a distributed system, and we will be devoting a whole chapter to them.

For now blockchain can be considered as a network where BADS is the data shared. To determine who adds and validates blocks is to decide which nodes add and which nodes validate the data they all share. This is still a distributed system, so we somehow need every node to independently, but identically, agree on which block will be added. This is called achieving consensus and is the responsibility of consensus algorithms, which we will cover in detail in the following chapters.

There is broad range of consensus algorithms and there is still active research invested in them. Consensus algorithms can have very diverse implementations so the following generalization does not necessarily apply to all of them.

A generalized, but heavily inspired by PoW and original implementations of PoS, consensus algorithm requires that all nodes store BADS and actively participate in trying to add a block or validate a candidate. The general process for a node goes as follows:

  1. A node collects, and verifies, data to put into its block, fills out its block header and then tries to validate it. Block validation goes beyond the validity of the data and is the main point where consensus algorithms converge. Block validation is always meant to take some noticeable amount of time so the network isn’t flooded with candidate blocks.

    1. The node successfully validates its block at which point it broadcasts its candidate block to the rest of the network.

    2. Before the node can validate its own block another candidate block comes in. At this point the node stops working on its own block and checks if the candidate is valid. In the case it is valid the node abandons its attempt, adds the candidate to its own blockchain and tries to create the block that follows it.

Achieving consensus therefore happens at an individual node level, but it happens simultaneously. What makes a block valid is known to the whole network, via the consensus algorithm, so every block can individually determine whether a block is valid, and it expresses its consent by adding it to its own local blockchain. When this happens at a majority of nodes we consider the block to be successfully added to the blockchain.

But what about immutability? Can’t someone just produce a new set of valid blocks that replace existing blocks and then publish them to the network that accepts them? In theory yes but the likelihood of something like this happening is very low. We first must understand that at anyone time there may be multiple competing chains and the valid chain is the one that is the biggest/longest. This is because creating, and broadcasting, a valid block is a time race and so as long as a majority of nodes are honest, and accept that blockchain is add-only, the attackers won’t be able to catch up with the longer, valid, chain that the honest nodes add onto.

Maintaining BADS is difficult to comprehend. There is no one copy of BADS as every node stores its own. The ‘current’ version of BADS can be understood as the version a majority of nodes currently store. This brings in the concept of ‘identicalness as a requirement’ where consensus algorithms guarantee every node has the same version of BADS. This requirement also essentially makes BADS impossible to destroy as to do it one would have to destroy the local version of every node.

An honest majority

We based alot of BADS maintenance on the idea that a majority of nodes are honest. This might seem like a design flaw, but it is one of the central assumptions of a valid blockchain. As we will discover later in the book this assumption mostly holds but there have been cases when it hasn’t with tragic results.

1.1.3. Implications of Blockchain being a solution

Coming back to BGP, we now have a solution on the format of the messages that the generals send to each other. The messages are put in a blockchain, the genesis block can contain none or some very basic information they all agree on like how many generals there are, and each general has a local version. Generals don’t need to trust the messengers, as a malicious message would not be supported by previous data in the blockchain, or other generals. They only need to trust the system and that a majority of the generals are honest.

This solution has an important implication, the concept that we can trust the system rather than any particular participant in it. We trust the blockchain, we don’t need to trust any individual node/general/website. This is partially achieved by every node verifying everything on its own and trusting no particular block/data sent to it.

From our discussion above we can determine several benefits of BADS:

  • Censorship-Resistant - Anyone can join a distributed system.

  • Very stable - For BADS to disappear every node has to delete its local version of it, a very unlikely event.

  • Works well over an unstable connection - With Merkle Trees and immediate verification upon receiving, data that is kept is guaranteed to be what the sender intended it to be.

  • Verification is conducted individually so there is no need to trust any individual participant

There are however also drawbacks to BADS:

  • Inefficient by design - Every participant has to store the exact same version of the blockchain which from a storage standpoint, is as inefficient as it gets.

  • Limited Data - Although we haven’t gotten into this just yet, the verification of data is difficult and so a blockchain is often limited to only one type of data it can store.

  • Dishonest Majority - There is always the small possibility that a majority of the nodes are dishonest in which case our concept of a valid blockchain collapses.

  • Blockchain Bugs - Because of how blockchains are implemented, if there is a bug it is, like the data, immutable. We will cover this in detail in a later chapter.

This doesn’t mean BADS is inherently flawed, it just goes to show that BADS isn’t a universal solution to everything. Like databases and centralized systems, blockchain has use-cases it’s useful for and use-cases it’s not.

Note

One of the reasons why we introduced you to blockchain through Byzantine Generals is because we want to address a common misconception: blockchain isn’t only finance. Although Bitcoin and DeFi are successful financial applications of the blockchain they aren’t its only applications.

1.1.3.1. Blockchain Jargon

Like any field, Blockchain has alot of jargon that is important for us to understand as we progress through the book.

  • Genesis Block - The first block of the Blockchain and the only one that points to nothing.

  • Block Height - The number of blocks added since the genesis block, 0-indexed.

  • Block Header - Where all the information about the Block is stored.

  • Merkle Tree - A cryptographically ordered binary tree data structure.

  • Block Hash - The data signature of the Block header.

  • Chain - A linked list of blocks ending in the genesis block. At any one time there may be multiple chains competing to become the valid chain.

  • Valid Chain - The chain that is the correct chain of blocks. Only data in the valid chain is considered valid by the blockchain.