jepsen.core

Entry point for all Jepsen tests. Coordinates the setup of servers, running tests, creating and resolving failures, and interpreting results.

Jepsen tests a system by running a set of singlethreaded processes, each representing a single client in the system, and a special nemesis process, which induces failures across the cluster. Processes choose operations to perform based on a generator. Each process uses a client to apply the operation to the distributed system, and records the invocation and completion of that operation in the history for the test. When the test is complete, a checker analyzes the history to see if it made sense.

Jepsen automates the setup and teardown of the environment and distributed system by using an OS and client respectively. See run! for details.

analyze!

(analyze! test)

After running the test and obtaining a history, we perform some post-processing on the history, run the checker, and write the test to disk again. Takes a test map. Returns a new test with results.

conj-op!

(conj-op! test op)

Add an operation to a tests’s history, and returns the operation.

log-results

(log-results test)

Logs info about the results of a test to stdout, and returns test.

log-test-start!

(log-test-start! test)

Logs some basic information at the start of a test: the Git version of the working directory, the lein arguments to re-run the test, etc.

maybe-snarf-logs!

(maybe-snarf-logs! test)

Snarfs logs, swallows and logs all throwables. Why? Because we do this when we encounter an error and abort, and we don’t want an error here to supercede the root cause that made us abort.

prepare-test

(prepare-test test)

Takes a test and prepares it for running. Ensures it has a :start-time, :concurrency, and :barrier field. Wraps its generator in a forgettable reference, to prevent us from inadvertently retaining the head.

This operation always succeeds, and is necessary for accessing a test’s store directory, which depends on :start-time. You may call this yourself before calling run!, if you need access to the store directory outside the run! context.

primary

(primary test)

Given a test, returns the primary node.

run!

(run! test)

Runs a test. Tests are maps containing

:nodes A sequence of string node names involved in the test :concurrency (optional) How many processes to run concurrently :ssh SSH credential information: a map containing… :username The username to connect with (root) :password The password to use :sudo-password The password to use for sudo, if needed :port SSH listening port (22) :private-key-path A path to an SSH identity file (~/.ssh/id_rsa) :strict-host-key-checking Whether or not to verify host keys :logging Logging options; see jepsen.store/start-logging! :os The operating system; given by the OS protocol :db The database to configure: given by the DB protocol :remote The remote to use for control actions. Try, for example, (jepsen.control.sshj/remote). :client A client for the database :nemesis A client for failures :generator A generator of operations to apply to the DB :checker Verifies that the history is valid :log-files A list of paths to logfiles/dirs which should be captured at the end of the test. :nonserializable-keys A collection of top-level keys in the test which shouldn’t be serialized to disk. :leave-db-running? Whether to leave the DB running at the end of the test.

Jepsen automatically adds some additional keys during the run

:start-time When the test began :history The operations the clients and nemesis performed :results The results from the checker, once the test is completed

In addition, tests have some fields added by Jepsen which are present during their execution, but not persisted.

:barrier A CyclicBarrier, mainly used for synchronizing DB setup :store State used for reading and writing data to and from disk :sessions Connected sessions used by jepsen.control to talk to nodes

Tests proceed like so:

  1. Setup the operating system

  2. Try to teardown, then setup the database

  • If the DB supports the Primary protocol, also perform the Primary setup on the first node.
  1. Create the nemesis

  2. Fork the client into one client for each node

  3. Fork a thread for each client, each of which requests operations from the generator until the generator returns nil

  • Each operation is appended to the operation history
  • The client executes the operation and returns a vector of history elements
    • which are appended to the operation history
  1. Capture log files

  2. Teardown the database

  3. Teardown the operating system

  4. When the generator is finished, invoke the checker with the history

  • This generates the final report

run-case!

(run-case! test)

Takes a test with a store handle. Spawns nemesis and clients and runs the generator. Returns test with no :generator and a completed :history.

snarf-logs!

(snarf-logs! test)

Downloads logs for a test. Updates symlinks.

synchronize

(synchronize test)(synchronize test timeout-s)

A synchronization primitive for tests. When invoked, blocks until all nodes have arrived at the same point.

This is often used in IO-heavy DB setup code to ensure all nodes have completed some phase of execution before moving on to the next. However, if an exception is thrown by one of those threads, the call to synchronize will deadlock! To avoid this, we include a default timeout of 60 seconds, which can be overridden by passing an alternate timeout in seconds.

with-client+nemesis-setup-teardown

macro

(with-client+nemesis-setup-teardown [test-sym test] & body)

Takes a binding vector of a test symbol and a test map. Sets up clients and nemesis, and rebinds (:nemesis test) to the set-up nemesis. Evaluates body. Afterwards, ensures clients and nemesis are torn down.

with-db

macro

(with-db test & body)

Wraps body in DB setup and teardown.

with-log-snarfing

macro

(with-log-snarfing test & body)

Evaluates body and ensures logs are snarfed afterwards. Will also download logs in the event of JVM shutdown, so you can ctrl-c a test and get something useful.

with-logging

macro

(with-logging test & body)

Sets up logging for this test run, logs the start of the test, evaluates body, and stops logging at the end. Also logs test crashes, so they appear in the log files for this test run.

with-os

macro

(with-os test & body)

Wraps body in OS setup and teardown.

with-resources

macro

(with-resources [sym start stop resources] & body)

Takes a four-part binding vector: a symbol to bind resources to, a function to start a resource, a function to stop a resource, and a sequence of resources. Then takes a body. Starts resources in parallel, evaluates body, and ensures all resources are correctly closed in the event of an error.

with-sessions

macro

(with-sessions [test' test] & body)

Takes a test’ test binding form and a body. Starts with test-expr as the test, and sets up the jepsen.control state required to run this test–the remote, SSH options, etc. Opens SSH sessions to each node. Saves those sessions in the :sessions map of the test, binds that to the test' symbol in the binding expression, and evaluates body.