jepsen.db.watchdog

Databases often like to crash, and they may not restart themselves automatically, which means we have to do it for them. This creates the possibility of all kinds of race conditions. This namespace provides a watchdog which runs in a thread for the duration of the test, and a DB wrapper for unreliable DBs.

Watchdogs know whether the server they supervise is running. They can be enabled or disabled, which determines whether they restart the server or not. They can be killed once, which destroys their thread. They can also be locked, during which the watchdog takes no actions.

db

(db opts db)

Wraps an existing database in a one with watchdogs for each node. Takes a map of partial options to watchdog (just :running? and :interval), and a DB to wrap. Uses db/start! to start the database. Ensures that db/kill! and db/start! disable and enable the watchdog.

The DB you wrap must implement the full suite of DB protocols–LogFiles, Primary, etc. This is a little awkward, but feels preferable to having a zillion variants of the DB class here. You can generally return nil from (e.g.) log-files, primaries, etc., and Jepsen will do sensible things.

disable!

(disable! w)

Informs the watchdog that it should not restart the daemon. Returns once we can guarantee the watchdog won’t restart.

enable!

(enable! w)

Informs the watchdog that it should restart the daemon. Returns immediately.

kill!

(kill! w)

Kills the watchdog, terminating its thread. Blocks until complete.

locking

macro

(locking watchdog & body)

Locks a Watchdog for the duration of body. Use this around any of your code that starts/stops the server, to avoid race conditions where you e.g. kill the server and the watchdog immediately restarts it.

run!

(run! w test node)

Starts the mainloop, calling step! repeatedly. Returns watchdog immediately.

step!

(step! w test node)

The main step of the watchdog. Periodically invoked by the runner.

watchdog

(watchdog {:keys [running? start!], :as opts})

Creates a new Watchdog for a single node. Takes an options map with:

{:running?  A function (running? test node) which returns true iff the
            server is running.
 :start!    A function (start! test node) which starts the server.
 :interval  Time, in ms, between checking to restart the server.
            Default 1000.}

Both running? and start? are evaluated with a jepsen.control connection bound to the given node.