jepsen.nemesis.file

Fault injection involving files on disk. This nemesis can copy chunks randomly within a file, induce bitflips, or snapshot and restore chunks.

clear-snapshots!

(clear-snapshots!)

Asks corrupt-file! to clear its snapshots directory on the currently bound remote node.

corrupt-file!

(corrupt-file! {:keys [mode file start end chunk-size mod i probability]})

Calls corrupt-file on the currently bound remote node, taking a map of options.

corrupt-file-nemesis

(corrupt-file-nemesis)(corrupt-file-nemesis default-opts)

This nemesis takes operations like

{:f :copy-file-chunks :value {:node “n2” :file “/foo/bar” :start 128 :end 256 :chunk-size 16 :mod 5 :i 2}} …}

This corrupts the file /foo/bar on n2 in the region [128, 256) bytes, dividing that region into 16 KB chunks, then corrupting every fifth chunk, starting with (zero-indexed) chunk 2: 2, 7, 12, 17, …. Data is copied from other chunks in the region which are not interfered with by this command, unless mod is 1, in which case chunks are randomly selected. The idea is that this gives us a chance to produce valid-looking structures which might be dereferenced by later pointers.

{:f :bitflip-file-chunks :value {:node “n3” :file “/foo/bar” :start 512 :end 1024 :probability 1e-3}}

This flips roughly one in a thousand bits, in the region of /foo/bar between 512 and 1024 bytes. The mod, i, and chunk-size settings all work as you’d expect.

{:f :snapshot-file-chunks :value {:node “n2” :file “/foo/bar” :start 128 :end 256 :chunk-size 16 :mod 5 :i 2}} …}

:snapshot-file-chunks uses the same start/end/chunk logic as corrupt-file, chunks. However, instead of corrupting chunks, it copies them to files in /tmp. These chunks can be restored by a corresponding :restore-file-chunks operation:

{:f :restore-file-chunks :value {:node “n2” :file “/foo/bar” :start 128 :end 256 :chunk-size 16 :mod 5 :i 2}} …}

This uses the same start/end/chunk logic, but copies data from the most recently snapshotted chunks back into the file itself. You can use this to force what looks like a rollback of parts of a file’s state–for instance, to simulate a misdirected or lost write.

All options are optional, except for :node and :file. See resources/corrupt-file.c for defaults.

This function can take an optional map with defaults for each file-corruption operation.

f->mode

(f->mode f)

Turns an op :f into a mode for the corrupt-file binary.

helix-gen

(helix-gen f-gen)(helix-gen default-corruption-opts f-gen)

Takes default options for a single file corruption map (see corrupt-file-nemesis), and a generator which produces operations like {:f :bitflip-file-chunks}. Returns a generator which fills in values for those operations, such that in each operation, every node corrupts the file, but they all corrupt different, stable regions.

(helix-gen {:chunk-size 2} (gen/repeat {:type :info, :f :copy-file-chunks}))

When first invoked, selects a permutation of the nodes in the test, assigning each a random index from 0 to n-1. Each value emitted by the generator uses that index, and modulus n. If the permutation of node is is n1, n2, n3, and the chunk size is 2 bytes, the affected chunks look like:

node file bytes 0123456789abcde … n1 ╳╳ ╳╳ ╳╳ n2 ╳╳ ╳╳ ╳ n3 ╳╳ ╳╳

This seems exceedingly likely to destroy a cluster, but some systems may survive it. In particular, systems which keep their on-disk representation very close across different nodes may be able to recover from the intact copies on other nodes.

nodes-gen

(nodes-gen n f-gen)(nodes-gen n default-opts f-gen)

Takes a number of nodes n, a map of default file corruption options (see corrupt-file-nemesis), and a generator of operations like {:type :info, :f :snapshot-file-chunk} with nil :values. Returns a generator which fills in values with file corruptions, restricted to at most n nodes over the course of the test. This corrupts bytes [128, 256) on at most two nodes over the course of the test.

(nodes-gen 2 {:start 128, :end 256} (gen/repeat {:type :info, :f :bitflip-file-chunks}))

n can also be a function (n test), which can be used to select (e.g.) a minority of nodes. For example, try (comp jepsen.util/minority count :nodes).

This generator is intended to stress systems which can tolerate disk faults on up to n nodes, but no more.