Tillered Docs | Clustering

How Arctic peers discover each other, establish trust, and converge on shared state

Arctic peers form a cluster without a central coordinator. Each peer learns about the others through gossip, proves who it is through signed handshakes, and merges shared state on every heartbeat.

Cluster identity

A cluster is anchored to a CustomerID that comes from the license. Two peers merge state only when their CustomerIDs match, so the license boundary is also the cluster boundary. Renewing or upgrading a license does not change the CustomerID, which means the cluster keeps its identity across license changes.

A separate ClusterID (prefix clu_) is minted the first time a cluster bootstraps. If two halves of a cluster that were running independently later meet, the convergence rule keeps the older ClusterID (the smaller ULID wins) so the merged cluster ends up with a single identity.

Every peer carries its own identity as well: a ULID (prefix peer_) and an Ed25519 keypair. The private half stays on the host, and the peer signs its announcements and its requests to other peers with it. Signatures, not the version numbers attached to records, are what give a peer authority over its own data.

Discovery is not trust

Peers find each other through gossip, but learning that a peer exists is a separate step from trusting it. A newly discovered peer is a name and an address until a handshake proves it holds the right key.

Handshake

When a peer discovers another, it runs a challenge-response handshake to establish mutual trust:

The initiator sends a 32-byte random challenge.
The initiator signs a message built from the challenge, its own peer ID, and the license ID. The responder signs a message built from the challenge, both peer IDs, and the license ID.
Each side verifies the other's signature against the public key it holds for that peer.

If verification fails, the handshake is retried up to three times. After that the peer is marked unreachable and left alone until something changes. The CustomerID equality check happens slightly later, when gossip merges the peer record.

Vouchers

A handshake assumes each side already knows the other's public key. A brand-new peer joining an established cluster does not have that key yet. Vouchers bridge the gap with transitive trust: if A trusts B and B vouches for C, then A accepts C.

A voucher is a short-lived signed statement, with a 24-hour TTL by default. It names the vouched peer, carries that peer's public key, and is signed by the vouching peer. Because the public key is inside the signed record, a verifier that already trusts the signer can extract and trust the key without a direct handshake first.

A peer attaches X-Voucher and X-Voucher-From headers to its requests until it completes a direct handshake, after which it authenticates with its own signature and stops sending the voucher. Vouchers are refreshed automatically when they fall within a 6-hour window of expiring, so trust does not lapse on a long-lived peer.

Gossip and convergence

Cluster state spreads by push. There is no endpoint to pull the full registry; instead every heartbeat carries the sender's state and the receiver merges it. Heartbeats fire on a default 60-second interval, which you can override with CLUSTER_HEARTBEAT_INTERVAL_SECONDS.

Each heartbeat carries:

signed peer announcements
signed service announcements
peer-connectivity records
a SHA-256 hash of the peer registry and of the service registry, used to detect drift

When the hashes disagree, peer and service differences resolve through the push merge that is already happening. Only a license difference triggers a pull, where the peer fetches the newer license from whoever holds it.

Single-writer authority

Each replicated record has exactly one legitimate writer. A service is owned by its source peer; a peer record is owned by the peer itself. Only the owner may change a record, and it signs every announcement.

The merge rule is a plain version comparison: a higher version replaces a lower one. Authority comes from the signature, not the version, so a receiver verifies that an announcement is signed by the record's rightful owner before applying it. An unsigned or wrongly signed announcement is rejected no matter how high its version claims to be, which stops a non-owner from rewriting another peer's data.

This is a source-peer-authoritative model. It is not a CRDT, vector-clock, or delta-state scheme, and you do not need to reason about concurrent writers fighting over a record: only the owner ever writes.

Deletions and tombstones

A record is deleted only by its owner, which produces a signed deletion proof. The deleted record becomes a tombstone rather than disappearing, so the deletion propagates the same way an update does. The signature on the tombstone stops a stale copy elsewhere from resurrecting the record. Tombstones are pruned 30 days after the deletion.

Eventual consistency

Writes apply locally right away and propagate to other peers over the next few heartbeats. All peers converge on the same state without a central coordinator. A read on one peer can briefly see older state than a write that just landed on another, which is the normal tradeoff for a coordinator-free design.

Deployment modes

A cluster can run fully decentralized or with a designated server peer.

Decentralized

With no server block in your compose configuration, every peer is equal. Each one accepts operator requests, participates in gossip, and can host services. This is the default and suits clusters where any peer is a valid entry point.

Designated server peer

Setting server.peer names one peer as the entry point for centralized operations. An optional server.fallback_peer provides a recovery path if the primary server peer is unavailable. The fallback is a recovery mechanism, not a load balancer: traffic does not spread across the two.

The server.features flags (webui, stun) are reserved in v1 and have no effect yet.

Internal-only peers

Each peer has an api_access setting. The default, full, exposes the user-facing operator endpoints. Setting it to internal makes the peer reject user-facing requests; such a peer participates in the cluster but is reachable only through a recovery token. Use this for peers that should carry traffic without being an operator entry point.

Cluster credential

Peers authenticate operator-style requests to each other using a shared OAuth credential that is replicated like any other cluster record. Any peer can rotate it, and the higher version wins clusterwide.

Rotation keeps the previous secret valid for a 24-hour propagation window so gossip has time to reach every peer before the old secret stops working. Redistribute the new secret to your operators within that window or some will be locked out. See Credential management for the full procedure.

Troubleshooting

Peers not syncing

# Check this peer's health
arctic health

# Force a gossip round
arctic cluster sync

# Inspect gossip activity
journalctl -u arctic | grep gossip

State divergence

Verify network connectivity between the peers.
Check for clock skew; running NTP on every host is recommended.
Trigger a manual sync with arctic cluster sync.
Review the agent logs for handshake or merge errors.

Clustering