jepsen.nemesis.membership

EXPERIMENTAL: provides standardized support for nemeses which add and remove nodes from a cluster.

This is a tricky problem. Even the concept of cluster state is complicated: there is Jepsen’s knowledge of the state, and each individual node’s understanding of the current state. Depending on which node you ask, you may get more or less recent (or, frequently, divergent) views of cluster state. Cluster state representation is highly variable across databases, which means our standardized state machine must allow for that variability.

We are guided by some principles that crop up repeatedly in writing these sorts of nemeses:

We should avoid creating useless cluster states–e.g. those that can’t fulfill any requests–for very long.
There are both safe and unsafe transitions. In general, commands like join/remove should always be safe. Removing data, however, is unsafe unless we can prove the node has been properly removed.
We want to leave nodes running, with data files intact, after removing them. This is when interesting things happen.
We must be safe in the presence of concurrent node kill/restart operations.
Nodes tend to go down or fail to reach the rest of the cluster, but we want to continue making decisions during this time.
Requested changes to the cluster may time out, or simply take a while to perform. We need to remember these ongoing operations, use them to constrain our choices of further changes (e.g. if four node removals are underway, don’t initiate a fifth), and find ways to resolve those ongoing changes, e.g. by confirming they took place.

Our general approach is to define a sort of state machine where the state is our representation of the cluster state, how all nodes view the cluster, and the set of ongoing operations, plus any auxiliary material (e.g. after completing a node removal, we can delete its data files). This state is periodically updated by querying individual nodes, and also by performing operations–e.g. initiating a node removal.

The generator constructs those operations by asking the nemesis what sorts of operations would be legal to perform at this time, and picking one of those. It then passes that operation back to the nemesis (via nemesis/invoke!), and the nemesis updates its local state and performs the operation.

initial-state

(initial-state test)

Constructs an initial cluster state map for the given test.

view source

node-view-future

(node-view-future test state running? opts node)

Spawns a future which keeps the given state atom updated with our view of this node.

view source

node-view-interval

How many seconds between updating node views.

view source

package

(package opts)

Constructs a nemesis and generator for membership operations. Options are a map like

{:faults #{:membership …} :membership membership-opts}.

Membership opts are:

{:state A record satisfying the State protocol :log-resolve-op? Whether to log the resolution of operations :log-resolve? Whether to log each resolve step :log-node-views? Whether to log changing node views :log-view? Whether to log the entire cluster view.

The package includes a :state field, which is an atom of the current cluster state. You can use this (for example) to have generators which inspect the current cluster state and use it to target faults.

view source