jepsen.store.format

Jepsen tests are logically a map. To save this map to disk, we originally wrote it as a single Fressian file. This approach works reasonably well, but has a few problems:

We write test files multiple times: once at the end of a test, and once once the analysis is complete–in case the analysis fails. Rewriting the entire file is inefficient. It would be nice to incrementally append new state.
Histories are enormous relative to tests, but we force readers to deserialize them before being able to get to any other high-level keys in the test–for instance, the result map.
It might be nice, someday, to have histories bigger than fit into memory.
We have no way to incrementally write the history, which means if a test crashes during the run we lose everything.
Deserializing histories is a linear process, but it would be nice for analyses to be able to parallelize.
The web view needs a little metadata quickly: the name, the date, the valid field of the result map. Forcing it to deserialize the entire world to get this information is bad.
Likewise, loading tests at the REPL is cumbersome–if all one wants is the results, you should be able to skip the history. Working with the history should ideally be lazy.

I held off on designing a custom serialization format for Jepsen for many years, but at this point the design constraints feel pretty well set, and I think the time is right to design a custom format.

File Format Structure

Jepsen files begin with the magic UTF8 string JEPSEN, followed by a 32-byte big-endian unsigned integer version field, which we use to read old formats when necessary. Then there is a 64-bit offset into the file where the block index–the metadata structure–lives. There follows a series of blocks:

   6            32             64

In general, files are written by appending blocks sequentially to the end of the file—this allows Jepsen to write files in (mostly) a single pass, without moving large chunks of bytes around. When one is ready to save the file, one writes a new index block to the end of the file which provides the offsets of all the (active) blocks in the file, and finally updates the block-index-offset at the start of the file to point to that most-recent index block.

All integers are signed and big-endian, unless otherwise noted. This is the JVM, after all.

Blocks may be sparse–their lengths may be shorter than the distance to the start of the next block. This is helpful if one needs to rewrite blocks later: you can leave padding for their sizes to change.

The top-level value of the file (e.g. the test map) is given in the block index.

Block Structure

All blocks begin with an 8-byte length prefix which indicates the length of the block in bytes, including the length prefix itself. Then follows a CRC32 checksum. Third, we have a 16-bit block type field, which identifies how to interpret the block. Finally, we have the block’s data, which is type-dependent.

  64        32       16

| length | checksum | type | … data …

Checksums are computed by taking the CRC32 of the data region, THEN the block header: the length, the checksum (all zeroes, for purposes of computing the checksum itself), and the type. We compute checksums this way so that writers can write large blocks of data with an unknown size in a single pass.

Index Blocks (Type 1)

An index block lays out the overall arrangement of the file: it stores a map of logical block numbers to file offsets, and also stores a root id, which identifies the block containing the top-level test map. The root id comes first, and is followed by the block map: a series of pairs, each a 32-bit logical block ID and an offset into the file.

 32      32       64       32      64

root id | id 1 | offset 1 | id 2 | offset 2 | …

There is no block with ID 0: 0 is used as a nil sentinel when one wishes to indicate the absence of a block.

Fressian Blocks (Type 2)

A Fressian block encodes data (often a key-value map) using the Fressian serialization format. This is already the workhorse for Jepsen serialization, but we introduce a twist: large values, like the history and results, can be stored in other blocks. That way you don’t have to deserialize the entire thing in order to read the top-level structure.

We create a special datatype, BlockRef, which we encode as a ‘block-ref’ tag in Fressian. This ref simply contains the ID of the block which encodes that tag’s value.

| fressian data … |

PartialMap (Type 3)

Results are a bit weird. We want to efficiently fetch the :valid? field from them, but the rest of the result map could be enormous. To speed this up, we want to be able to write part of a map (for instance, just the results :valid? field), and store the rest in a different block.

A PartialMap is essentially a cons cell: it comprises a Fressian-encoded map and a pointer to the ID of a rest block (also a PartialMap) which encodes the remainder of the map. This makes access to those parts of the map encoded in the head cell fast.

| rest-ptr | fressian data …

When rest-ptr is 0, that indicates there is no more data remaining.

FressianStream (Type 4)

A FressianStream block allows us to write multiple Fressian-encoded values into a single block. We represent it as:

| fressian data 1 … | fressian data 2 … | …

Writers can write any number of Fressian-encoded values to the stream one after the next. Readers start at the beginning and read values until the block is exhausted. There is no count associated with this block type; it must be inferred by reading all elements. We generally deserialize streams as vectors to enable O(1) access and faster reductions over elements.

BigVector (Type 5)

Histories are chonky boys. 100K operations (each a map) are common, and it’s conceivable we might want to work with histories of tens of millions of operations. We also want to write them incrementally, so that we can recover from crashes. It’s also nice to be able to deserialize small bits of the history, or to reduce over it in parallel. To do this, we need a streaming format for large vectors.

We write each chunk of the vector as a separate block. Then we refer to those chunks with a BigVector, which stores some basic metadata about the vector as a whole, and then pointers to each block. Its format is:

 64        64        32          64         32

Count is the number of elements in the vector overall. Index 1 is always 0–the offset of the first element in the first chunk. Pointer 1 is the block ID of the Fressian block which contains the first chunk’s data. Index 2 is the index of the first element in the second chunk, and pointer 2 is the block ID of the second chunk’s data, and so on.

Chunk data can be stored in a Fressian block, a FressianStream block, or another BigVector.

Access to BigVectors looks very much like a regular Clojure vector. We deserialize chunks on-demand, caching results as they’re accessed. We can offer O(1) count through the count field. We implement nth by finding the chunk a given index belongs to and then looking up the index in that chunk. Assoc works by assoc’ing into that particular chunk, leaving other chunks unchanged.

That’s It

There’s a lot of obvious stuff I’ve left out here–metadata, top-level integrity checks, garbage collection, etc etc… but I think we can actually skip almost all of it and get a ton of benefit for the limited use case Jepsen needs.

Write the header.
Write an empty vector as block 1, for the history.
Write the initial test map as a PartialMap block to block 2, pointing to block 1 as the history. Write an index block pointing to 2 as the root.
Write the history incrementally as the test proceeds. Write operations as they occur to a new FressianStream block. Periodically, and at the end of the history:

a. Seal that FressianStream block, writing the headers. Call that block id B. b. Write a new version of the history block with a new chunk appended: B. c. Write a new index block with the new history block version.

This ensures that if we crash during the run, we can recover at least some of the history up to the most recent checkpoint.

Write the results as a PartialMap to blocks 4 and 5: 4 containing the :valid? field, and 5 containing the rest of the results.
The test may contain state which changed by the end of the test, and we might want to save that state. Write the entire test map again as block 6, again using block 1 as the history, and now block 5 as the results map. Write a new index block with block 6 as the root.

To read this file, we:

Check the magic and version.
Read the index block offset.
Read the index block into memory.
Look up the root block ID, use the index to work out its offset, read that block, and decode it into a lazy map structure.

When it comes time to reference the results or history in that lazy map, we look up the right block in the block index, seek to that offset, and decode whatever’s there.

Decoding a block is straightforward. We grab the length header, run a CRC over that region of the file, check the block type, then decode the remaining data based on the block structure.

append-to-big-vector-block!

(append-to-big-vector-block! w element)

Appends an element to a BigVector block writer. This function is asynchronous and returns as soon as the writer’s queue has accepted the element. Close the writer to complete the process. Returns writer.

Generated by Codox

Jepsen 0.3.8

Project

Namespaces

Public Vars

jepsen.store.format

File Format Structure

Block Structure

Index Blocks (Type 1)

Fressian Blocks (Type 2)

PartialMap (Type 3)

FressianStream (Type 4)

BigVector (Type 5)

That’s It

append-to-big-vector-block!

append-to-fressian-stream-block!

assoc-block!

big-vector-block-writer!

big-vector-block-writer-worker!

big-vector-chunk-size

big-vector-count-size

big-vector-index-size

block-checksum

block-checksum-given-data-checksum

block-checksum-offset

block-checksum-size

block-header

block-header-checksum

block-header-for-data

block-header-for-length-and-checksum!

block-header-length

block-header-size

block-header-type

block-id-size

block-index-data-size

block-index-offset-offset

block-len-offset

block-len-size

block-offset-size

block-ref

block-references

block-type->short

block-type-offset

block-type-size

check-block-checksum

check-magic

check-version!

close!

copy!

current-version

find-references

first-block-offset

flush!

fressian-buffer-size

fressian-read-handlers

fressian-stream-block-writer!

fressian-write-handlers

gc!

IPartialMap

protocol

members

partial-map-rest-id

large-region-size

load-block-index!

magic

magic-offset

magic-size

new-block-id!

next-block-offset

open

prep-read!

prep-write!

read-big-vector-block

read-block-by-id

read-block-by-offset

read-block-by-offset*

read-block-data

read-block-header

read-block-index-block

read-block-index-offset