Zookeeper- What is it?

8 min readMar 10, 2023

Zookeeper Overview-

In this era, no high scale system can survive without distributed architecture. Most of the high-scale applications need to be scaled horizontally as scaling vertically has its limitation. When we scale a system horizontally it has its own set of unique and challenging problems like general consensus, data inconsistency, process synchronization, resource management, fault tolerance, atomicity, etc.

So in this distributed world, where does Apache Zookeeper fits- As the name suggest it’s a library that allows us to have coordination in distributed systems. Here are some of the distributed system problems that zookeeper solves-

Leader Election- In a multi-node cluster, the election can be held by a zookeeper through various algorithms. You can develop your algorithm(discussed in detail later) or use some standard consensus algorithms for leader election like RAFT, ZAB and PAXOS
Cluster Management- Work as a registry for clusters, maintains if a server is joining or leaving a cluster, and other miscellaneous info.
Configuration Management- It can keep general application configuration that all nodes in a cluster can access.
Locks in systems — Zookeeper can help to maintain locks on a shared resource as it helps for mutual exclusivity and prevents data corruption and data inconsistency.
Synchronization service- Zookeeper can be used for implementing any custom build synchronization algorithm using its znodes structure(discussed below)

Client — can be distributed servers(not external clients querying the application)

Zookeeper can be in itself of distributed nature to have high availability among it, as shown in the image above.

Zookeeper architecture-

Before diving deep into the complex leader election algorithms it’s important that we understand a simple leader election algorithm and how we can implement it ourselves using zookeeper. To do that we need to have a basic understanding of zookeeper architecture.

Zookeeper maintains a tree-like data structure for storing values. The nodes in the tree are called znodes. These znodes are similar to folders/files in the Unix file system with some additional abilities(these additional abilities help us to solve many distributed system debacles).

Some important properties of znodes-

Stores data and have children znodes(like a typical n-order tree)
Supports operations like create, delete, getData, exists, setData, etc.
Supports username/password-based authentication and has a watch feature on znodes(to notify clients about any changes in znodes children or znodes data)

Different types of znodes-

Persistent Znode- These nodes are permanent znodes(as the name suggests), once created they will be there forever unless you manually delete them.
Ephemeral Znode- These are temporary znodes, they come and go with the client that created them. After a client goes down or we shut it off all the ephemeral znodes created by the client get deleted by the zookeeper.

Persistent Sequential Znode- Same as persistent znode, but only that zookeeper attaches a sequential number as a suffix to it. For ex- pznode1_0001,pz_node2_00002.

Ephemeral Sequential Znode- Same as ephemeral znode, but only that zookeeper attaches a sequential suffix to it. For ex- eznode_0001,eznode_0002.

Sequential Znodes are very useful in leader elections- have discussed in deep below.

Zookeeper internal architecture(every node present is a znode)

Leader election-

Many distributed systems follow the design of leader-follower, in which a server is elected as a leader through a consensus algorithm. Having a leader in a distributed system solves many problems and has huge advantages attributed to it.

Some of the benefits of having a leader-

A single place to look for logs and metrics reduces partial failure.
It can simply tell other servers(followers) about some important changes rather than building consensus.
It offers clients consistency because they can see and control changes made to the state of the system.
It maintains all servers in the same state by using consensus algorithms, hence providing atomicity to the client.

Example of a simple leader election using ephemeral sequential znodes-

Any server (among all available servers) creates a persistent znode- /election. (As only one of them will succeed in creating a persistent znode)
We will maintain the list of servers as ephemeral znodes(will have data as server.domain.com) under another persistent znode /servers.

3. We will use the watch functionality of znodes and add a watch on /election znode, such that any addition or deletion to the child nodes of /election will be notified to all the servers with the help of the above znodes.

4. Now each server that joins the cluster will create an ephemeral sequential znode under the node /election. Data of this ephemeral sequential znode will be their server name (server.domain.com).

For example, we have 4 servers joining the cluster they will create znodes like this-

5. After all the servers complete the creation of znode under /election znode they will call getChildren(“/election”) which will fetch them the data of the least sequenced child node i.e /leader-0001. Hence we get the hostname of our leader.

6. Now in the future, if our leader goes down. Its ephemeral node will also get deleted and as we have set a watch on child nodes of /election every server will be notified about it. Then every server will again call getChildren(“/election”) and this time they will get /leader-0002 as /leader-0001 was an ephemeral node which will get deleted with the server going down.

This is one of the naive algorithm which we can use in simple distributed systems for leader election.

Famous Consensus algorithms for leader election and log replication-

I am going to discuss two famous Consensus-based leader election algorithms- RAFT(Reliable, Replicated, Redundant, and Fault Tolerant) and ZAB(Zookeeper Atomic Broadcast).

Let’s first understand two terminologies- Linerilization and Total Order Broadcast in brief.

Lineralizability — It guarantees recency. The basic idea behind it is to make a system appear as if there is always one copy of the data and all the operations are atomic in it. It is the strongest form of consistency

Total Order Broadcast — It guarantees reliable delivery and total ordered delivery. Messages are asynchronous, so messages are guaranteed to be delivered but no guarantee about when they will be delivered.

So if we see things from clients perspectives lineralizability is better than total order broadcasts.

RAFT algorithm-

Unlike earlier where we consider our node to be a leader or follower, in RAFT a node can be in one of these 3 states-

Leader
Candidate
Follower

Raft seperates its leader election process from log replication.

Raft’s Leader election algorithm -

Every node starts as a follower.
So the current leader sends regular messages to all the followers in some specific time intervals, this is known as Heartbeat. And every follower has a randomised election timeout defines in it ranging between 150ms — 300ms. So if a follower does not receive Heartbeat from the leader before the election timeout happens, the follower becomes a candidate.
Every election has a term number, whenever a candidate starts a elections it increases the term number and then starts the election.Now the candidate starts a election with a incremental term number, it votes for itself and requests vote messages from all other nodes(followers).
Unlike earlier here the request votes contains the term number of election and the candidate’s log. If the candidate which is requesting vote has less updated data in its log than the follower, then also the follower do not vote for the candidate.
When the candidate receives the majority of the vote(not necessary all of it), It sends a heartbeat message to establish its authority. After receiving the heartbeat all the followers accept the candidate as the leader.

There may be the case where there are more than one candidates, this can result in multiple leaders, which is a scenario that we should always avoid. So to avoid this, term number comes into the play. If a candidate gets majority and becomes a leader, it sends a heartbeat message which is obviously received by other candidates, other candidates compares there term number with the leaders term number, if there is lower or same they accept defeat and becomes follower and if there term number is greater than the leader they continue there election process(they become a rebel).

So for a same term there can always be only one leader, if a candidate wins election for a greater term number, then the previous term’s leader become the follower.

Raft’s Log replication algorithm-

RAFT follows the principle of lineralization for log replication.

Each write or change is redirected towards the leader
The leader follows 2 Phase commits, in the first phase, he writes the change in its logs and leaves it uncommitted.
It replicates the value and sends it to all the followers, and it waits for the majority of the followers to write the entry in their logs.
After receiving the response from the majority of the followers, In the second phase the leader commits the entry and then asks all the followers to commit this value.
The leader in RAFT ensures that every log of the follower is the same as its log, if some of the followers have fewer logs or more logs, Leader synchronizes its log with the follower forcefully. Then the follower commits the change.
RAFT implements Lineralization. The cluster now come to a consensus about the system state.

ZAB Algorithm-

ZAB is very similar to RAFT, hence I will just compare it with RAFT and point out some minor differences.

Similar to RAFT, ZAB also separates its leader election process from the log replication process.

Zab leader election algorithm is very similar to RAFT’s, there are just some terminological differences like- the term number is called zxid, their protocol nomenclature is also different, etc. But the core logic is the same.

ZAB’s log replication also uses 2 Phase Commits like RAFT, but unlike RAFT where lineralization is implemented, in ZAB total order broadcast is implemented. So its logs only require append-only capability.

There is a special edge case scenario- If the leader fails in the commit phase and at least one of the followers has committed the change, then this follower becomes the leader. Then after being selected as the leader, it lets all other followers commit the change.

Uses-
RAFT- CoreOS(etcd), Kubernetes(etcd) etc (etcd uses RAFT as its consensus algorithm)
ZAB- Apache Kafka, Apache Bookeeper, Apache HDFS, etc (They all use zookeeper where ZAB is implemented)

PS-
RAFT is considered to be better than ZAB, then also prominent projects like Zookeeper used ZAB because when zookeeper was in the development stage RAFT was not yet released. Hence zookeeper developed its consensus algorithm ZAB. And that’s how ZAB originated.

I hope this blog helped you to provide a general overview of zookeeper, the internals of zookeeper, leader election and log replication algorithms.