What is a BFT consensus protocol?
Any blockchain must organize the transactions in a linear order of the time of their execution, which may be different from their time of submission. Instead of trying to achieve consensus on individual transactions, it is more efficient to group them in blocks which are a finite ordered list of transactions. If a block is accepted as valid and a consensus has been achieved, that means the same thing as achieving a consensus on all of the included transactions in the mentioned order. Blocks are ordered in a sequence which is aptly named as a block-chain or simply blockchain. The chain has a beginning called the genesis block. Everyone starts off with the belief that the genesis block is valid.
The distance of any block from the genesis block in term of the number of blocks in-between is called the height of the block. The height of the genesis block can be set as zero. The height of the first block after that is one, and so on. The job of the blockchain protocol is to make sure that every honest party agrees on the block that is acceptable given any particular height.
In a BFT protocol, each participating party sends messages to other parties in a prescribed manner until they come to a conclusion about a certain candidate block. There is the requirement that once an honest party is able to decide a certain transaction to be valid and agreed upon, no other honest party would deny that. Each honest party confirms a consensus by making sure that a large majority of other parties (honest or dishonest) agree to the same block for that particular height. The consensus algorithm is run fresh for each height, one after another.
Each party needs to know the public key of the other parties to validate their signatures. Otherwise, one single dishonest party can send the P-messages on behalf of everyone to create a consensus of his choice.
Why use a BFT protocol?
Blockchain BFT protocols provide a better way of achieving block consensus than the proof of work used in Bitcoin.
A theorem from Fisher, Lynch and Paterson rules out any possibility of achieving a consensus with even one faulty process in a completely asynchronous setting. So, in order to have a consensus, we assume some form of weak synchrony. Weak synchrony implies that the processes have almost synchronized time and the network is connected for some time during which the messages are guaranteed to be delivered within a specified amount of time.
BFT protocols have the following advantages over proof of work algorithm used in Bitcoin –
- There are two properties that we need for consensus, safety and liveness. Bitcoin guarantees liveness under any condition, but safety is guaranteed only when the network is connected. On the other hand, BFT provides safety under any condition and liveness when the network is connected.
- Proof of Work consumes a huge amount of computation resources. In contrast, BFT does not require such expensive computation.
The only drawback is that the chain must know all of the miner’s public keys. This can be resolved by doing a Proof of Stake algorithm along with the consensus. Currently out of the scope of this blog.
GoSig BFT protocol
GoSig is a BFT consensus protocol that provides security even in the public network.
In this blog, I will try to develop the protocol in steps to explain the concepts behind it.
As discussed in the previous section, the GoSig protocol must satisfy the 2 properties of consensus i.e Safety and Liveness.
- Safety: Whenever any honest user commits a block for a given height, no other honest user can commit a different block at the same height even when the attacker has full control of the network. This ensures that the commits are final and unchangeable. So once a transaction is included in a block committed by any honest user, he knows for sure that that transaction cannot be undone. Note that only the honest party himself can know that since any other party cannot know whether he is honest or not and a dishonest party can do anything including committing something for no good reason. The protocol needs to ensure this property even when the attacker has full control of the network and can create an arbitrary number of network partitions.
- Liveness: It should always be possible to commit a new block by all honest parties at the next height as long the network connection works, i.e. the attacker loses control of the network. Basically, this condition is about not getting into a deadlock. This is important for the chain to proceed. Of course, the parties cannot commit if the network does not forward their P messages, so this has to only work when the network is connected.
Step 1 – A first cut BFT algorithm:
We assume that 2 third users are honest. Every user is able to make a choice somehow or the other. There is a possibility of not all choices are visible to every honest user when the network is fragmented or due to delay in the network.
Based on the above assumptions, every honest user should automatically agree on a choice as long as the network is connected.
Let’s look at what happens when the network is broken. For every height, any honest party does the following.
- The honest party selects one from the possible choices for the next block and chooses one of them by some deterministic means.
- The honest party then broadcasts a signed P message for his chosen block and waits for the others to broadcast theirs.
When the honest user receives 2f+1 votes from other parties for any particular choice for the block, he commits this block as the chosen one.
Let us see if the above algorithm satisfies the ‘safety’ property. If any block receives 2f+1 P messages, at least f+1 of them must be from honest parties, because the maximum number of dishonest parties are the only f. These honest parties will never send P message for a different block. So the maximum number of P messages that can be produced for any other block must be (3f+1) – (f+1) = 2f which is not sufficient for a commit. So, if some block is committed for a certain height by an honest party, no other block can be committed by any honest party. So the first condition is satisfied. This is great!
However, it might happen that the honest parties accidentally chose more than one different candidates for the next block. For example f honest users may P message candidate_1 and f+1 users may end up P messaging candidate_2. The attacker simply does not do anything and enjoys what’s going on the system. In this case, no choice of the block has sufficient P messages for commit, so the system is stuck. The second property of liveness has been violated even when the network is connected.
Step 2 – Network Deadlock
We need some way to get out of the above deadlock situation. The obvious idea is of course to restart the whole protocol over whenever there is such kind of a deadlock. The problem with this is its impossible to know when a deadlock has happened from just incoming P-messages.
Suppose we conclude that a deadlock has happened after receiving some s P-messages from different parties and the party still fails to commit. Now, let’s say in some case f honest users may P message candidate_1 and f+1 users may end up P messaging candidate_2. This is obviously a deadlock and hence these 2f+1 messages must be able to detect it and hence s ≤ 2f+1.
Suppose, now in a different situation 2f (group A) honest users have P-messaged candidate_1 and 1 (group B) honest user did not P-message anything yet (because he is yet to receive the possible candidates). Now, the attacker can send a P-message candidate_2 to group A which would convince the group A to abort (it has 2f+1 P-messages and still cannot commit, and s ≤ 2f+1). After group A aborts this and restarts the protocol, suppose they all decide candidate_2 (maybe because they have come to know of candidate_2 now). So group A now has 2f P-messages for candidate_2. The attacker now provides 1 P-message for candidate_2 so that group A can now commit candidate_2.
Notice that all this time, group B has not been able to either abort or commit any block. So, now the attacker can replay the recording of the first round for group A (showing evidence of 2f P-messages from group A for candidate_1 from his saved messages collected from the round) and adds his own 1 P-message for candidate_1. This provides 2f+1 messages for candidate_1 to group B and hence group B commits candidate_1. In effect, group A committed candiate_2 and group B committed candidate_1 for the same height. This breaks our condition 1.
The easiest solution to this problem is to timeout around after some reasonable time if a commit cannot be achieved before that. As long as the clocks of each processor are more or less same, this should make sure everyone is in the same retry round. This assumption is a synchrony assumption called partial synchrony. We will go into more details about it later. For the time being, let us just assume that all nodes maintain the exact same time and message propagation time is zero for any message.
The next proposed solution is very simple.
- The honest party selects one from the possible choices for the next block and chooses one of them by some deterministic means.
- The honest party then broadcasts a signed P message for his chosen block and waits for the others to broadcast theirs.
- When the honest party receives 2f+1 P-votes from other parties for any particular choice for the block, he commits this block as the chosen one.
- When an honest party has passed a fixed amount of round time (say R), he forgets everything and restarts the process from step 1.
It solves our liveness problem, but the safety property is broken now. Suppose 2f+1 parties send P-messages for candidate_1. The messages reach only 1 honest party who commits candiate_1. The rest of the user waits until R time has passed and restarts the process where they can easily commit candidate_2 with one P-message from the attacker.
Step 3 – Adding tentative commit of TC messages:
To avoid both of the problems, we need to put one extra step of messages that we would call the tentative commit or TC messages. The idea is to separate P-messaging for a different block in the new round and actually committing a block in the previous round. Sometimes a user needs to observe the new round while he still did not decide anything for his current choice for the candidate block. Every honest user does the following in every round.
- The honest party selects one from the possible choices for the next block and chooses one of them by some deterministic means. If he has TC-ed a block in any previous round, he must P the same block again if that is one of the candidates, otherwise, he does not P anything.
- The honest party then broadcasts a signed P message for his chosen block and waits for the others to broadcast theirs.
- When the honest party receives 2f+1 P-votes from other parties for any particular choice for the block, he sends a TC message that contains a proof of all the P-messages. Basically, he copies all the signatures contained in each of the P-messages for the same candidate block. This is to ensure that a dishonest user cannot create a TC message without first receiving 2f+1 P messages for the same candidate. After this, he would send no more P messages for the higher round for a different block (but will do it if he sees the same block suggested by the leader).
- Whenever a user sees any TC message in a round, he immediately TC-messages the same block. This is easy, he simply copies the proof of P-messages and signs it.
- Whenever a user finds 2f+1 TC messages for the same block for the current round and the current height, he commits it.
- When an honest party has passed a fixed amount of round time (say R), he forgets everything and restarts the process from step 1. If the user has TC-ed a block in this round, he must keep proposing and P-voting the same block in later rounds until he TC-ed for a different block in a later round.
This was much more complicated than the previous one. This is mostly because it has a new step and lots of nuances. Let’s see how this satisfies the safety and the liveness properties.
Safety property: We need to prove that if any honest party commits a block, there would be no more commits at the same height. This ensures that there is only one possible commit for any height.
When an honest party commits a block, he must have observed 2f+1 TC messages for the same block. Since the attacker can only send f of them, f+1 TC messages must have come from honest users. These honest users will never P-vote for the same height for a different block. So, any further rounds can only have a maximum of 2f P-votes for any other block which would not even be sufficient for a TC message. Without any TC message for any other block, no honest user will commit any other block at any further round, proving our case. Two different commits cannot, of course, happen at the same block because of the fact that it is not possible to make TC message for two different candidates at the same height. If basically proves that even though honest parties can commit a block for the same height in different rounds, they would always commit the same block.
Liveness property: We need to make sure that no matter what happens in the network and message order and whatever the messages sent by the attacker, when the network recovers, all honest users will commit some block. We break it into the following cases.
- If no TC-messages are sent in any round before it is over, there is no change in the system. The whole process just restarts. This won’t go on forever as in every round there is a probability of this not happening.
- If there are some TC-messages created, 2f+1 P-messages must be present in this round supporting the same candidate, because the proof of 2f+1 P-messages is contained in the TC message. f+1 of them must be sent by honest parties. Hence the maximum number of P-votes for a different candidate in the same round is 2f which is not enough to create a TC-message. So, if a round has one TC message, it does not have any other supporting a different candidate. This means the TC messages in the same round always support the same candidate. This makes it okay to forward a TC-message whenever an honest party sees it.
- If the TC-ed users are honest, he would eventually be promoted to the leader when the network is connected, at which point he would P-message the same block and everyone will agree.
- Since an honest user always signs a TC-message that he sees in around, when the network is connected, every honest user will see such message and will send TC-message supporting the same block. Since there are at least 2f+1 honest parties and there is only one candidate that can get TC-messages, this candidate will get 2f+1 TC-messages when the network is connected.
Now that we have a way to achieve consensus, I would delve into how block proposals are created by any party.
Creating the block proposals:
Given that blocks can be proposed by anyone, we need to handle the following situations
- If the attacker is allowed to create lots of block at any point in time, the attacker will keep creating too many blocks to simply burden the system with too much processing. Such an attack is called a denial of service attack.
- If the attacker can create too many blocks in a round to consider, it would become improbable for all honest users to agree on which one to P-message, thus making the system go on forever without committing anything.
To tackle the first problem, we allow one party to suggest only one candidate per round, and that too only after a commit happened for a previous height. So, to suggest a candidate, any user has to first prove that there was a commit that happened in the previous round. Since a candidate will eventually but definitely get committed after 2f+1 TC-messages for the same has been broadcasted, we require that those TC-messages be presented while suggesting a new block for the next height. So, any valid new candidate suggestion for the height h must contain 2f+1 TC-messages.
To tackle the second problem, a small set of leaders are selected at random. This ensures that less number of blocks are proposed, thereby increasing the probability of commit. The random selection must be publicly verifiable to be random, otherwise, the attacker will simply choose random numbers that favour him and simply claim that he got those numbers randomly.
One way to get a publicly verifiable random number for any party would be to simply concatenate current height, current round and the party’s public key and take the hash of it. Since cryptographic hashes have uniform random distribution, this makes a good source of the random number. We say a party has been selected as a leader if his random number for the round is less than a certain value which is chosen based on the fraction of parties to be chosen as leaders per round on average. If the maximum value of a hash is N and the fraction of parties chosen as a leader be x, then the condition for selection would be H<x/N, where H is the random number for the given party (H=Hash(height||round||publickey)).
Although the above idea works, Jing Chen et. al., the creators of the Algorand algorithm had a better idea. They used a technique called a verifiable random number that has all the properties of what we had suggested, and it cannot be known to any other party before the owner of the private key discloses it. Instead of taking the hash of the concatenation of the height, round and public key, they start with the hash of the concatenation of height and round. They then use a unique signature algorithm to sign this hash using the party’s private key. A unique signature has no random element and there is only one valid signature for a single message. RSA is one such signature algorithm. Another is quite a new one called the BLS signature. They then sign the hash using the party’s private key. The random number H=Hash(S) where S=Signpry(Hash(height||round)) . The signature S is included with the random number as a proof so that it is publicly verifiable that the hash is correctly generated and the generator has no way to control it. Since the signature can only be generated by the owner of the private key, other parties cannot know about it until he discloses it.
The leader must include the signature as part of his candidate suggestion. This makes sure that the leader cannot be identified before he already made the suggestion. That way, he cannot be corrupted by the attacker beforehand.
To increase the likelihood of every honest user P-messaging the same candidate, we need to find a deterministic way to choose between multiple candidates. A simple rule can be that we choose the one with the minimum value of the verifiable random number in the first suggestion.
Now that we have a working BFT algorithm, I would like to dive into the details of network delay and clock delay.
Handling delays and timing tolerance:
Let T be the maximum difference between any two parties, S be the maximum network delay for receiving a message when the network is connected, R is the time for each round, M be the maximum processing time for each round. We consider the following under full network connectivity.
- When a round starts, the honest leaders wait for T time so that everyone is in the same round.
- After this, they broadcast their candidates by P-messaging them.
- It takes S time for everyone to get this message.
- Everyone then P-messages the correct choice.
- It takes S time for everyone to receive all P-messages.
- Everyone then TC-messages the block.
- It takes S time for everyone to receive the TC-messages and commit it.
- Let the maximum amount of time taken for processing be M
So in total, R must be greater than T+3S+M. This will always ensure a commit under full network connectivity while allowing the system to work fast.
Summary:
The correct protocol I have described here is the Gosig protocol proposed in the paper https://arxiv.org/pdf/1802.01315.pdf by Li, Wang and Chen (I changed a few little details, but they don’t matter). We went step by step into developing a working BFT protocol that can be used in a blockchain. There are quite a few others with somewhat different approaches, we do not cover them here.