jepsen.store.format

Jepsen tests are logically a map. To save this map to disk, we originally wrote it as a single Fressian file. This approach works reasonably well, but has a few problems:

  • We write test files multiple times: once at the end of a test, and once once the analysis is complete–in case the analysis fails. Rewriting the entire file is inefficient. It would be nice to incrementally append new state.

  • Histories are enormous relative to tests, but we force readers to deserialize them before being able to get to any other high-level keys in the test–for instance, the result map.

  • It might be nice, someday, to have histories bigger than fit into memory.

  • We have no way to incrementally write the history, which means if a test crashes during the run we lose everything.

  • Deserializing histories is a linear process, but it would be nice for analyses to be able to parallelize.

  • The web view needs a little metadata quickly: the name, the date, the valid field of the result map. Forcing it to deserialize the entire world to get this information is bad.

  • Likewise, loading tests at the REPL is cumbersome–if all one wants is the results, you should be able to skip the history. Working with the history should ideally be lazy.

I held off on designing a custom serialization format for Jepsen for many years, but at this point the design constraints feel pretty well set, and I think the time is right to design a custom format.

File Format Structure

Jepsen files begin with the magic UTF8 string JEPSEN, followed by a 32-byte big-endian unsigned integer version field, which we use to read old formats when necessary. Then there is a 64-bit offset into the file where the block index–the metadata structure–lives. There follows a series of blocks:

   6            32             64

| “JEPSEN” | version | block-index-offset | block 1 | block 2 | …

In general, files are written by appending blocks sequentially to the end of the file—this allows Jepsen to write files in (mostly) a single pass, without moving large chunks of bytes around. When one is ready to save the file, one writes a new index block to the end of the file which provides the offsets of all the (active) blocks in the file, and finally updates the block-index-offset at the start of the file to point to that most-recent index block.

All integers are signed and big-endian, unless otherwise noted. This is the JVM, after all.

Blocks may be sparse–their lengths may be shorter than the distance to the start of the next block. This is helpful if one needs to rewrite blocks later: you can leave padding for their sizes to change.

The top-level value of the file (e.g. the test map) is given in the block index.

Block Structure

All blocks begin with an 8-byte length prefix which indicates the length of the block in bytes, including the length prefix itself. Then follows a CRC32 checksum. Third, we have a 16-bit block type field, which identifies how to interpret the block. Finally, we have the block’s data, which is type-dependent.

  64        32       16

| length | checksum | type | … data …

Checksums are computed by taking the CRC32 of the data region, THEN the block header: the length, the checksum (all zeroes, for purposes of computing the checksum itself), and the type. We compute checksums this way so that writers can write large blocks of data with an unknown size in a single pass.

Index Blocks (Type 1)

An index block lays out the overall arrangement of the file: it stores a map of logical block numbers to file offsets, and also stores a root id, which identifies the block containing the top-level test map. The root id comes first, and is followed by the block map: a series of pairs, each a 32-bit logical block ID and an offset into the file.

 32      32       64       32      64

root id | id 1 | offset 1 | id 2 | offset 2 | …

There is no block with ID 0: 0 is used as a nil sentinel when one wishes to indicate the absence of a block.

Fressian Blocks (Type 2)

A Fressian block encodes data (often a key-value map) using the Fressian serialization format. This is already the workhorse for Jepsen serialization, but we introduce a twist: large values, like the history and results, can be stored in other blocks. That way you don’t have to deserialize the entire thing in order to read the top-level structure.

We create a special datatype, BlockRef, which we encode as a ‘block-ref’ tag in Fressian. This ref simply contains the ID of the block which encodes that tag’s value.

| fressian data … |

PartialMap (Type 3)

Results are a bit weird. We want to efficiently fetch the :valid? field from them, but the rest of the result map could be enormous. To speed this up, we want to be able to write part of a map (for instance, just the results :valid? field), and store the rest in a different block.

A PartialMap is essentially a cons cell: it comprises a Fressian-encoded map and a pointer to the ID of a rest block (also a PartialMap) which encodes the remainder of the map. This makes access to those parts of the map encoded in the head cell fast.

   32

| rest-ptr | fressian data …

When rest-ptr is 0, that indicates there is no more data remaining.

FressianStream (Type 4)

A FressianStream block allows us to write multiple Fressian-encoded values into a single block. We represent it as:

| fressian data 1 … | fressian data 2 … | …

Writers can write any number of Fressian-encoded values to the stream one after the next. Readers start at the beginning and read values until the block is exhausted. There is no count associated with this block type; it must be inferred by reading all elements. We generally deserialize streams as vectors to enable O(1) access and faster reductions over elements.

BigVector (Type 5)

Histories are chonky boys. 100K operations (each a map) are common, and it’s conceivable we might want to work with histories of tens of millions of operations. We also want to write them incrementally, so that we can recover from crashes. It’s also nice to be able to deserialize small bits of the history, or to reduce over it in parallel. To do this, we need a streaming format for large vectors.

We write each chunk of the vector as a separate block. Then we refer to those chunks with a BigVector, which stores some basic metadata about the vector as a whole, and then pointers to each block. Its format is:

 64        64        32          64         32

| count | index 1 | pointer 1 | index 2 | pointer 2 | …

Count is the number of elements in the vector overall. Index 1 is always 0–the offset of the first element in the first chunk. Pointer 1 is the block ID of the Fressian block which contains the first chunk’s data. Index 2 is the index of the first element in the second chunk, and pointer 2 is the block ID of the second chunk’s data, and so on.

Chunk data can be stored in a Fressian block, a FressianStream block, or another BigVector.

Access to BigVectors looks very much like a regular Clojure vector. We deserialize chunks on-demand, caching results as they’re accessed. We can offer O(1) count through the count field. We implement nth by finding the chunk a given index belongs to and then looking up the index in that chunk. Assoc works by assoc’ing into that particular chunk, leaving other chunks unchanged.

That’s It

There’s a lot of obvious stuff I’ve left out here–metadata, top-level integrity checks, garbage collection, etc etc… but I think we can actually skip almost all of it and get a ton of benefit for the limited use case Jepsen needs.

  1. Write the header.

  2. Write an empty vector as block 1, for the history.

  3. Write the initial test map as a PartialMap block to block 2, pointing to block 1 as the history. Write an index block pointing to 2 as the root.

  4. Write the history incrementally as the test proceeds. Write operations as they occur to a new FressianStream block. Periodically, and at the end of the history:

a. Seal that FressianStream block, writing the headers. Call that block id B. b. Write a new version of the history block with a new chunk appended: B. c. Write a new index block with the new history block version.

This ensures that if we crash during the run, we can recover at least some of the history up to the most recent checkpoint.

  1. Write the results as a PartialMap to blocks 4 and 5: 4 containing the :valid? field, and 5 containing the rest of the results.

  2. The test may contain state which changed by the end of the test, and we might want to save that state. Write the entire test map again as block 6, again using block 1 as the history, and now block 5 as the results map. Write a new index block with block 6 as the root.

To read this file, we:

  1. Check the magic and version.

  2. Read the index block offset.

  3. Read the index block into memory.

  4. Look up the root block ID, use the index to work out its offset, read that block, and decode it into a lazy map structure.

When it comes time to reference the results or history in that lazy map, we look up the right block in the block index, seek to that offset, and decode whatever’s there.

Decoding a block is straightforward. We grab the length header, run a CRC over that region of the file, check the block type, then decode the remaining data based on the block structure.

append-to-big-vector-block!

(append-to-big-vector-block! w element)

Appends an element to a BigVector block writer. This function is asynchronous and returns as soon as the writer’s queue has accepted the element. Close the writer to complete the process. Returns writer.

append-to-fressian-stream-block!

(append-to-fressian-stream-block! writer data)

Takes a FressianStreamBlockWriter and a Clojure value. Appends that value as Fressian to the stream. Returns writer.

assoc-block!

(assoc-block! handle id offset)

Takes a handle, a block ID, and its corresponding offset. Updates the handle’s block index (in-memory) to add this mapping. Returns handle.

big-vector-block-writer!

(big-vector-block-writer! handle elements-per-chunk)(big-vector-block-writer! handle block-id elements-per-chunk)

Takes a handle, a optional block ID, and the maximum number of elements per chunk. Returns a BigVectorBlockWriter which can have elements appended to it via append-to-big-vector-block!. Those elements, in turn, are appended to a series of newly created FressianStream blocks, each of which is stitched together into a BigVector block with the given ID. As each chunk of writes is finished, the writer automatically writes a new block index, ensuring we can recover at least part of the history from crashes.

The writer is asynchronous: it internally spawns a thread for serialization and IO. Appends to the writer are transferred to the IO thread via a queue; the IO thread then writes them to disk. Closing the writer blocks until the transfer queue is exhausted.

big-vector-block-writer-worker!

(big-vector-block-writer-worker! handle block-id elements-per-chunk queue)

Loop which writes values from a BigVectorBlockWriter’s queue to disk.

big-vector-chunk-size

How many elements should we write to a chunk of a BigVector before starting a new one?

big-vector-count-size

How many bytes do we use to store a bigvector’s count?

big-vector-index-size

How many bytes do we use to store a bigvector element’s index?

block-checksum

(block-checksum header data)

Compute the checksum of a block, given two bytebuffers: one for the header, and one for the data.

block-checksum-given-data-checksum

(block-checksum-given-data-checksum header data-crc)

Computes the checksum of a block, given a ByteBuffer header, and an already-computed CRC32 checksum of the data. Useful for streaming writers which compute their own checksums while writing. Mutates data-crc in place; I can’t figure out how to safely copy it.

block-checksum-offset

Where do we write a checksum in the block header?

block-checksum-size

How long is the checksum for a block?

block-header

(block-header)

Returns a blank ByteBuffer for a block header. All fields zero.

block-header-checksum

(block-header-checksum header)

Fetches the checksum of a block header.

block-header-for-data

(block-header-for-data block-type data)

Takes a block type and a ByteBuffer of data, and constructs a block header whose type is the given type, and which has the appropriate length and checksum for the given data.

block-header-for-length-and-checksum!

(block-header-for-length-and-checksum! block-type data-length data-checksum)

An optimized way to construct a block header, a block type, the length of a data region (not including headers) and the CRC checksum of that data. Mutates the checksum in place.

block-header-length

(block-header-length header)

Fetches the length of a block header.

block-header-size

How long is a block header?

block-header-type

(block-header-type header)

Returns the type of a block header, as a keyword.

block-id-size

How many bytes per block ID?

block-index-data-size

(block-index-data-size index)

Takes a block index and returns the number of bytes required for that block to be written, NOT including headers.

block-index-offset-offset

Where in the file do we write the offset of the index block?

block-len-offset

Where do we write a block length in a block header?

block-len-size

How long is the length prefix for a block?

block-offset-size

How many bytes per block offset address?

block-ref

(block-ref id)

Constructs a new BlockRef object pointing to the given block ID.

block-references

(block-references handle)(block-references handle block-id)

Takes a handle and a block ID, and returns the set of all block IDs which that block references. Right now we do this by parsing the block data; later we might want to move references into block headers. With no block ID, returns references from the root.

block-type->short

A map of block types to integer codes.

block-type-offset

Where do we store the block type in a block header?

block-type-size

How long is the type for a block?

check-block-checksum

(check-block-checksum header data)

Verifies the checksum of a block, given two ByteBuffers: one for the header, and one for the data.

check-magic

(check-magic handle)

Takes a Handle and reads the magic bytes, ensuring they match.

check-version!

(check-version! handle)

Takes a Handle and reads the version. Ensures it’s a version we can decode, and updates the Handle’s version if it hasn’t already been set.

close!

(close! handle)

Closes a Handle

copy!

(copy! r w)

Takes two handles: a reader and a writer. Copies the root and any other referenced blocks from reader to writer.

current-version

The current file version.

Version 0 was the first version of the file format.

Version 1 added support for FressianStream and BigVector blocks.

find-references

(find-references x)

A little helper function for finding BlockRefs in a nested data structure. Returns the IDs of all BlockRefs.

first-block-offset

Where in the file the first block begins.

flush!

(flush! handle)

Flushes writes to a Handle to disk.

fressian-buffer-size

How many bytes should we buffer before writing Fressian data to disk?

fressian-read-handlers

How do we read Fressian data?

fressian-stream-block-writer!

(fressian-stream-block-writer! handle)

Takes a handle. Creates a new block ID, and prepares to write a new FressianStream block at the end of the file. Returns a FressianStreamBlockWriter which can be used to write elements to the FressianStream. When closed, the writer writes the block header and updates the handle’s block index to refer to the new block.

fressian-write-handlers

How do we write Fressian data?

gc!

(gc! file)

Garbage-collects a file (anything that works with io/file) in-place.

IPartialMap

protocol

members

partial-map-rest-id

(partial-map-rest-id this)

large-region-size

How big does a file region have to be before we just mmap it instead of doing file reads?

load-block-index!

(load-block-index! handle)

Takes a handle, reloads its block index from disk, and returns handle.

magic

The magic string at the start of Jepsen files.

magic-offset

Where the magic is written

magic-size

Bytes it takes to store the magic string.

new-block-id!

(new-block-id! handle)

Takes a handle and returns a fresh block ID for that handle, mutating the handle so that this ID will not be allocated again.

next-block-offset

(next-block-offset handle)

Takes a handle and returns the offset of the next block. Right now this is just the end of the file.

open

(open path)

Constructs a new handle for a Jepsen file of the given path (anything which works with io/file).

prep-read!

(prep-read! handle)

Called when we read anything from a handle. Ensures that we’ve checked the magic, version, and loaded the block index.

prep-write!

(prep-write! handle)

Called when we write anything to a handle. Ensures that we’ve written the header before doing anything else. Returns handle.

read-big-vector-block

(read-big-vector-block handle buf)

Takes a handle and a ByteBuffer for a big-vector block. Returns a lazy vector (specifically, a soft chunked vector) representing its data.

read-block-by-id

(read-block-by-id handle id)

Takes a handle and a logical block id. Looks up the offset for the given block and reads it using read-block-by-offset (which includes verifying the checksum).

read-block-by-offset

(read-block-by-offset handle offset)

Takes a Handle and the offset of a block. Reads the block header, validates the checksum, and interprets the block data depending on the block type. Returns a map of:

{:type The block type, as a keyword :offset The offset of this block :length How many bytes are in this block, total :data The interpreted data stored in this block—depends on block type}

read-block-by-offset*

(read-block-by-offset* handle offset)

Takes a Handle and the offset of a block. Reads the block header and data, validates the checksum, and returns a map of:

{:header header, as bytebuffer :data data, as bytebuffer}

read-block-data

(read-block-data handle offset header)

Fetches the ByteBuffer for a block’s data, given a block header stored at the given offset.

read-block-header

(read-block-header handle offset)

Fetches the ByteBuffer for a block header at the given offset.

read-block-index-block

(read-block-index-block handle data)

Takes a ByteBuffer and reads a block index from it: a map of

{:root root-id :blocks {id offset, id2 offset2, …}}

read-block-index-offset

(read-block-index-offset handle)

Takes a handle and returns the current root block index offset from its file. Throws :type ::no-block-index if the block index is 0 or the file is too short.

read-file

(read-file file offset size)

Returns a ByteBuffer corresponding to a given file region. Uses mmap for large regions, or regular read calls for small ones.

read-fressian-block

(read-fressian-block handle data)

Takes a handle and a ByteBuffer of data from a Fressian block. Returns its parsed contents.

read-fressian-stream-block

(read-fressian-stream-block handle data)

Takes a handle and a ByteBuffer of data from a FressianStream block. Returns its contents as a vector.

read-partial-map-block

(read-partial-map-block handle data)

Takes a handle and a ByteBuffer for a partial-map block. Returns a lazy map representing its data.

read-root

(read-root handle)

Takes a handle. Looks up the root block from the current block index and reads it. Returns nil if there is no root.

read-test

(read-test handle)

Reads a test from a handle’s root. Constructs a lazy test map where history and results are loaded as-needed from the file. Leave the handle open so this map can use it; it’ll be automatically closed when this map is GCed. Includes metadata so that this test can be rewritten using write-results!

set-block-header-checksum!

(set-block-header-checksum! buf checksum)

Sets the checksum in a block header. Returns the block header.

set-block-header-length!

(set-block-header-length! buf length)

Sets the length in a block header. Returns the block header.

set-block-header-type!

(set-block-header-type! buf block-type)

Sets the type (a keyword) in a block header. Returns the header.

set-root!

(set-root! handle root-id)

Takes a handle and a block ID. Updates the handle’s block index (in-memory) to point to this block ID as the root. Returns handle.

short->block-type

A map of integers to block types.

test-history-writer!

(test-history-writer! handle test)(test-history-writer! handle test chunk-size)

Takes a handle and a test created with write-initial-test!, and returns a BigVectorBlockWriter for writing operations to the history. Append elements using append-to-big-vector-block!, and .close the writer when done.

version

(version handle)

Returns the version of a Handle.

version-offset

Where in the file the version begins

version-size

Bytes it takes to store a version.

write-big-vector-block!

(write-big-vector-block! handle id element-count chunks)

Takes a handle, a block ID, a count, and a vector of initial-index block-id chunks. Writes a BigVector block with the given count and chunks to the end of the file. Records the freshly written block in the handle’s block index, and returns ID.

write-block!

(write-block! handle offset block-type data)

Writes a block to a handle at the given offset, given a block type as a keyword and a ByteBuffer for the block’s data. Returns handle.

write-block-data!

(write-block-data! handle offset data)

Writes block data to the given block offset (e.g. the address of the header, not the data itself) in the file, backed by the given handle. Returns handle.

write-block-header!

(write-block-header! handle offset block-header)

Writes a block header to the given offset in the file backed by the given handle. Returns handle.

write-block-index!

(write-block-index! handle)(write-block-index! handle offset)

Writes a block index for a Handle, based on whatever that Handle’s current block index is. Automatically generates a new block ID for this index and adds it to the handle as well. Then writes a new block index offset pointing to this block index. Returns handle.

write-block-index-offset!

(write-block-index-offset! handle root)

Takes a handle and the offset of a block index block to use as the new root. Updates the file’s block pointer. Returns handle.

write-file!

(write-file! file offset buffer)

Takes a FileChannel, an offset, and a ByteBuffer. Writes the ByteBuffer to the FileChannel at the given offset completely. Returns number of bytes written.

write-fressian-block!

(write-fressian-block! handle data)(write-fressian-block! handle id data)

Takes a handle, an optional block ID, and some Clojure data. Writes that data to a Fressian block at the end of the file, records the new block in the handle’s block index, and returns the ID of the newly written block.

write-fressian-block!*

(write-fressian-block!* handle offset data)

Takes a handle, a byte offset, and some Clojure data. Writes that data to a Fressian block at the given offset. Returns handle.

write-fressian-to-file!

(write-fressian-to-file! file offset checksum data)

Takes a FileChannel, an offset, a checksum, and a data structure as Fressian. Writes the data structure as Fressian to the file at the given offset. Returns the size of the data that was just written, in bytes. Mutates checksum with written bytes.

write-header!

(write-header! handle)

Takes a Handle and writes the initial magic bytes and version number. Initializes the handle’s version to current-version if it hasn’t already been set. Returns handle.

write-initial-test!

(write-initial-test! handle test)

Writes an initial test to a handle, making the test the root. Creates an (initially nil) block for the history. Called when we first begin a test. Returns test with additional metadata, so we can write the history and results later.

write-partial-map-block!

(write-partial-map-block! handle m rest-id)(write-partial-map-block! handle id m rest-id)

Takes a handle, a Clojure map, and the ID of the block which stores the rest of the map (use nil if there is no more data to the PartialMap). Writes the map to a new PartialMap block, records it in the handle’s block index, and returns the ID of this block itself. Optionally takes an explicit ID for this block.

write-partial-map-block!*

(write-partial-map-block!* handle offset m rest-id)

Takes a handle, a byte offset, a Clojure map, and the ID of the block which stores the rest of the map (use nil if there is no more to the PartialMap). Writes the map and rest pointer to a PartialMap block at the given offset. Returns handle.

write-test!

(write-test! handle test)

Writes an entire test map to a handle, making the test the root. Useful for re-writing a completed test that’s already in memory, and migrating existing Fressian tests to the new format. Returns handle.

write-test-with-history!

(write-test-with-history! handle test)

Takes a handle and a test created with write-initial-test!, and writes it again as the root. Used for rewriting a test after running it, but before analysis, in case there’s state that changed. Returns test.

write-test-with-results!

(write-test-with-results! handle test)

Takes a handle and a test created with write-initial-test!, and appends its :results as a partial map block: :valid? in the top tier, and other results below. Writes test using those results and history blocks. Returns test, with ::results-id metadata pointing to the block ID of these results.