jepsen.tests.kafka

This workload is intended for systems which behave like the popular Kafka queue. This includes Kafka itself, as well as compatible systems like Redpanda.

At the abstract level of this workload, these systems provide a set of totally-ordered append-only logs called partitions, each of which stores a single arbitrary (and, for our purposes, unique) message at a particular offset into the log. Partitions are grouped together into topics: each topic is therefore partially ordered.

Each client has a producer and a consumer aspect; in Kafka these are separate clients, but for Jepsen’s purposes we combine them. A producer can send a message to a topic-partition, which assigns it a unique, theoretically monotonically-increasing offset and saves it durably at that offset. A consumer can subscribe to a topic, in which case the system aautomatically assigns it any number of partitions in that topic–this assignment can change at any time. Consumers can also assign themselves specific partitions manually. When a consumer polls, it receives messages and their offsets from whatever topic-partitions it is currently assigned to, and advances its internal state so that the next poll (barring a change in assignment) receives the immediately following messages.

Operations

To subscribe to a new set of topics, we issue an operation like:

{:f :subscribe, :value k1, k2, …}

{:f :assign, :value k1, k2, …}

… where k1, k2, etc denote specific partitions. For subscribe, we convert those partitions to the topics which contain them, and subscribe to those topics; the database then controls which specific partitions we get. Just like the Kafka client API, both subscribe and assign replace the current topics for the consumer.

Assign ops can also have a special key :seek-to-beginning? true which indicates that the client should seek to the beginning of all its partitions.

Reads and writes (and mixes thereof) are encoded as a vector of micro-operations:

{:f :poll, :value op1, op2, …} {:f :send, :value op1, op2, …} {:f :txn, :value op1, op2, …}

Where :poll and :send denote transactions comprising only reads or writes, respectively, and :txn indicates a general-purpose transaction. Operations are of two forms:

:send key value

… instructs a client to append value to the integer key–which maps uniquely to a single topic and partition. These operations are returned as:

[:send key offset value]

where offset is the returned offset of the write, if available, or nil if it is unknown (e.g. if the write times out).

Reads are invoked as:

:poll

… which directs the client to perform a single poll operation on its consumer. The results of that poll are expanded to:

[:poll {key1 [offset1 value1 offset2 value2 …], key2 …}]

Where key1, key2, etc are integer keys obtained from the topic-partitions returned by the call to poll, and the value for that key is a vector of offset value pairs, corresponding to the offset of that message in that particular topic-partition, and the value of the message—presumably, whatever was written by [:send key value] earlier.

When polling without using assign, clients should call .commitSync before returning a completion operation.

Before a transaction completes, we commit its offsets.

All transactions may return an optional key :rebalance-log, which is a vector of rebalancing events (changes in assigned partitions) that occurred during the execution of that transaction. Each rebalance event is a map like:

{:keys k1 k2 …}

There may be more keys in this map; I can’t remember right now.

Topic-partition Mapping

We identify topics and partitions using abstract integer keys, rather than explicit topics and partitions. The client is responsible for mapping these keys bijectively to topics and partitions.

Analysis

From this history we can perform a number of analyses:

For any observed value of a key, we check to make sure that its writer was either :ok or :info; if the writer :failed, we know this constitutes an aborted read.
We verify that all sends and polls agree on the value for a given key and offset. We do not require contiguity in offsets, because transactions add invisible messages which take up an offset slot but are not visible to the API. If we find divergence, we know that Kakfa disagreed about the value at some offset.

Having verified that each key offset pair uniquely identifies a single value, we eliminate the offsets altogether and perform the remainder of the analysis purely in terms of keys and values. We construct a graph where vertices are values, and an edge v1 -> v2 means that v1 immediately precedes v2 in the offset order (ignoring gaps in the offsets, which we assume are due to transaction metadata messages).

For each key, we take the highest observed offset, and then check that every :ok :send operation with an equal or lower offset was also read by at least one consumer. If we find one, we know a write was lost!
We build a dependency graph between pairs of transactions T1 and T2, where T1 != T2, like so:

ww. T1 sent value v1 to key k, and T2 sent v2 to k, and o1 < o2 in the version order for k.

wr. T1 sent v1 to k, and T2’s highest read of k was v1.

rw. T1’s highest read of key k was offset o1, and T2 sent offset o2 to k, and o1 < o2 in the version order for k.

Our use of “highest offset” is intended to capture the fact that each poll operation observes a range of offsets, but in general those offsets could have been generated by many transactions. If we drew wr edges for every offset polled, we’d generate superfluous edges–all writers are already related via ww dependencies, so the final wr edge, plus those ww edges, captures those earlier read values.

We draw rw edges only for the final versions of each key observed by a transaction. If we drew rw edges for an earlier version, we would incorrectly be asserting that later transactions were not observed!

We perform cycle detection and categorization of anomalies from this graph using Elle.

Internal Read Contiguity: Within a transaction, each pair of reads on the same key should be directly related in the version order. If we observe a gap (e.g. v1 < … < v2) that indicates this transaction skipped over some values. If we observe an inversion (e.g. v2 < v1, or v2 < … < v1) then we know that the transaction observed an order which disagreed with the “true” order of the log.
Internal Write Contiguity: Gaps between sequential pairs of writes to the same key are detected via Elle as write cycles. Inversions are not, so we check for them explicitly: a transaction sends v1, then v2, but v2 < v1 or v2 < … v1 in the version order.
Intermediate reads? I assume these happen constantly, but are they supposed to? It’s not totally clear what this MEANS, but I think it might look like a transaction T1 which writes v1 v2 v3 to k, and another T2 which polls k and observes any of v1, v2, or v3, but not all of them. This miiight be captured as a wr-rw cycle in some cases, but perhaps not all, since we’re only generating rw edges for final reads.
Precommitted reads. These occur when a transaction observes a value that it wrote. This is fine in most transaction systems, but illegal in Kafka, which assumes that consumers (running at read committed) never observe uncommitted records.

allowed-error-types

(allowed-error-types test)

Redpanda does a lot of things that are interesting to know about, but not necessarily bad or against-spec. For instance, g0 cycles are normal in the Kafka transactional model, and g1c is normal with wr-only edges at read-uncommitted but not with read-committed. This is a very ad-hoc attempt to encode that so that Jepsen’s valid/invalid results are somewhat meaningful.

Takes a test, and returns a set of keyword error types (e.g. :poll-skip) which this test considers allowable.

Generated by Codox

Jepsen 0.3.8

Project

Namespaces

Public Vars

jepsen.tests.kafka

Operations

Topic-partition Mapping

Analysis

allowed-error-types

analysis

around-key-offset

around-key-value

around-some

assocv

checker

condense-error

consume-counts

crash-client-gen

cycles!

datafy-version-order-log

downsample-plot

duplicate-cases

final-polls

firstv

g1a-cases

graph

index-seq

int-poll-skip+nonmonotonic-cases

int-poll-skip+nonmonotonic-cases-per-key

int-send-skip+nonmonotonic-cases

interleave-subscribes

key-order-viz

log->last-index->values

log->value->first-index

lost-write-cases

mop-index

must-have-committed?

nonmonotonic-send-cases

nth+

op->max-offsets

op->max-poll-offsets

op->max-send-offsets

op->thread

op-around-key-offset

op-around-key-value

op-pairs

op-read-offsets

op-read-pairs

op-reads

op-reads-helper

op-reads-index

op-write-offsets

op-write-pairs

op-writes

op-writes-helper

plot-bounds

plot-realtime-lag!

plot-realtime-lags!

plot-unseen!

poll-skip+nonmonotonic-cases

poll-skip+nonmonotonic-cases-per-process

poll-unseen

precommitted-read-cases

previous-value

readers-of

reads-by-type

reads-of-key

reads-of-key-offset

reads-of-key-value

realtime-lag

render-order-viz!

secondv

stats-checker

strip-types

tag-rw

track-key-offsets

txn-generator

unseen

version-orders