Perseus is a set of scripts to investigate a distributed database's responsiveness when one of its three nodes is isolated from the peers
|Database||Downtime on isolation||Partial downtime on isolation||Disturbance time||Downtime on recovery||Partial downtime on recovery||Recovery time||Version|
|TiDB (1)||15s||1s||16s||82s||8s||114s||PD: 1.1.0
Downtime on isolation: Complete unavailability (all three nodes are unavailable to write/read) on a transition from steady three nodes to steady two nodes caused by isolation of the third
Partial downtime on isolation Only one node is available on the 3-to-2 transition
Disturbance time Time bewteen the isolation and two nodes became steady
Downtime on recovery: Complete unavailability on the transition from steady two nodes to steady three nodes caused by rejoining of the missing node
Partial downtime on recovery: Only one node is available on the 2-to-3 transition
Recovery time: Time between connectivity is restored and all three nodes are available to write and read
What were tested?
All the testing systems have something similar. They are distributed consistent databases (key-value storages) which tolerates up to
n failure of
This testing suite uses a three nodes configuration with a fourth node acting as a client. The client spawns three threads (coroutines). Each thread opens a connection to one of the three DB's node and in loop reads, increments and writes a value back.
Once in a second it dumps an aggregated statistic in the following form:
#legend: time|gryadka1|gryadka2|gryadka3|gryadka1:err|gryadka2:err|gryadka3:err 1 128 175 166 0 0 0 2018/01/16 09:02:41 2 288 337 386 0 0 0 2018/01/16 09:02:42 ... 18 419 490 439 0 0 0 2018/01/16 09:02:58 19 447 465 511 0 0 0 2018/01/16 09:02:59
The first column is the number of seconds since the beginning of the experiment; the following three columns represent the number of increments per each node of the cluster per second, the next triplet is the number of errors per second, and the last one is time.
In case of MongoDB and Gryadka you can't control to which node a client connects and need to specify addresses of all nodes in a connection string, so each DB's replica has a colocated client (read-modify-write loop) while the fourth node was only responsible for aggregation of statistics and dumping it every second.
During a test, I isolated one of the DB's nodes from it peers and observed how it affected the outcome. Thanks to Docker Compose the tests are reproducible just with a couple of commands, navigate to a DB's subfolder to see instructions.
So, this is just a bunch of numbers based on the default parameters such as leader election timeout of various systems, isn't it?
Kind of, but there are anyway lot of interesting patterns behind the numbers.
Some systems have downtime on recovery, and some don't, TiDB has partial downtime on recovery and others don't. Gryadka doesn't even have downtime on isolation, so testing "defaults" helps to understand the fundamental properties of the replication algorithms each system uses.
Why there are three records for MongoDB?
I didn't manage to achieve stable results with MongoDB. Sometimes it behaved like any other leader-based system: isolation of a leader led to a temporary cluster-wide downtime and then to two steady working node, once a connection was restored all three nodes were working as usual. Sometimes it had issues:
- The downtime lasted almost two minutes and once it finished the RPS dropped from 282 before the isolation starter to less than one.
- A client on an isolated node couldn't connect to a cluster even after a connection was restored.
Why there are two records for TiDB?
When TiDB starts it works in a "warm-up" mode, however from a client perspective, it's indistinguishable from "normal" mode, so there are two records.
If a node became isolated during the "warm-up" mode, then the whole cluster became unavailable. Moreover, it doesn't recover even when a connection is restored. I fired a corresponding bug: https://github.com/pingcap/tidb/issues/2676.
N.B. When TiDB is a "warm-up" mode, the replication factor is one, so it's possible to lose acknowledged data in case of death of a single node.
What's wrong with YugaByte?
Gryadka is an experimental key/value storage. I created it as a demonstration that Paxos isn't hard as its reputation. Be careful! It isn't production ready, the best way to use it is to read sources (less than 500 lines of code) and write your implementation.