Tillered Docs
Maintenance

Troubleshooting

Diagnose and resolve common Arctic connectivity, handshake, and configuration problems

This page groups the problems operators hit most often into three areas: connectivity between agents, peer handshake failures, and configuration changes that do not take effect. Each entry lists the symptoms, the commands that narrow down the cause, and the fix.

Arctic v1.4.0 runs its TProxy and IP-tunnel data planes in-process and commits firewall rules straight to the kernel over netlink. There is no separate proxy daemon to restart, no .nft file on disk to edit, and no kernel WireGuard interface to inspect. The commands below reflect that: you read kernel state to diagnose, but the agent is the only thing that writes it.

Connectivity

Agent not responding

Symptoms: curl http://AGENT_IP:8080/livez times out or is refused, and CLI commands fail with a connection error.

Check the service first:

systemctl status arctic

If it is not running, start it and read the recent log:

systemctl start arctic
journalctl -u arctic -n 50

If the service is up, confirm the agent is listening on the API port:

ss -tlnp | grep 8080

A healthy agent shows a listener owned by the arctic process:

LISTEN  0  4096  *:8080  *:*  users:(("arctic",...))

If the process is listening but the host still refuses the connection, a host firewall is the usual cause. These commands inspect the host's own firewall, which is separate from the tables Arctic manages:

# nftables
nft list ruleset | grep 8080

# iptables
iptables -L INPUT -n | grep 8080

# firewalld
firewall-cmd --list-ports

Open TCP 8080, or stop the conflicting firewall (see Prerequisites for the recommended host setup). If another process already holds port 8080, free it or set API_PORT to a different value before starting the agent.

Peers cannot communicate

Symptoms: handshakes fail, heartbeats do not arrive, or peers show as unhealthy.

Test the API path from one host to the other:

curl http://PEER_IP:8080/livez

The IP tunnel carries non-TCP traffic over UDP 51840. Confirm that path is open as well:

nc -u PEER_IP 51840

If either fails, look at the route between the hosts:

traceroute PEER_IP
mtr PEER_IP

Both TCP 8080 and UDP 51840 must be reachable in both directions. When agents sit on different networks, check for NAT in the path and verify routing between the subnets.

Traffic is not being routed

Symptoms: a service exists but traffic does not flow, or packets are not being picked up by the proxy.

Confirm the service and its routes:

arctic services list
arctic services get SERVICE_ID

Inspect the firewall tables the agent commits to the kernel. TCP classification lives in the arctic table; tunnel marking lives in arctic_iptun:

nft list table inet arctic
nft list table inet arctic_iptun

You should see rules matching the source and destination CIDRs of your routes. If they are missing, check the firewall reconciler's log:

journalctl -u arctic | grep 'reconciler=firewall'

Resolution: verify the route CIDRs match the traffic you expect, force a cluster sync with arctic cluster sync, and confirm the source peer of the service is the agent you are testing from.

MACVLAN interface not created

Symptoms: a service sets requires_interface but no interface appears, or the interface exists without an address.

List interfaces and addresses:

ip link show
ip addr show

A service interface is named from the service ID, truncated to the kernel's 15-character limit, so look for a device matching the start of your service ID.

Check the network reconciler's log for the reason it was skipped or failed:

journalctl -u arctic | grep 'reconciler=network'

Resolution: confirm the host has a suitable parent interface, that the agent has CAP_NET_ADMIN (the systemd unit grants it), and that the interface name does not collide with an existing device.

DNS resolution

Symptoms: agents are unreachable by hostname, or lookups fail inside tunneled traffic.

nslookup HOSTNAME
dig HOSTNAME

Resolution: verify the host's resolvers, decide whether DNS should travel through Arctic at all, and add a route for the DNS server's IP if it should.

High latency

Symptoms: traffic through Arctic is slow, or round-trip times are high.

Compare a direct path against the tunneled path, and look for loss:

ping PEER_IP
mtr DESTINATION

Check whether a bandwidth limit is shaping the service:

arctic services get SERVICE_ID

Resolution: raise or remove the bandwidth_limit_mbps limit if it is too low, consider KCP transport on lossy links or for short, interactive flows, and rule out congestion on the underlying network.

Collecting debug information

When you open a support ticket, attach the output of:

arctic version
systemctl status arctic
journalctl -u arctic -n 100
ip addr show
ip route show
nft list table inet arctic
nft list table inet arctic_iptun
arctic peers list
arctic services list

Handshake failures

How a handshake works

When you add a peer, the two agents run a challenge-response handshake before they trust each other:

  1. The initiator sends a 32-byte random challenge.
  2. Both sides sign a message built from the challenge, the peer IDs, and the license ID.
  3. Each side verifies the other's signature against the public key it holds.

Only after this succeeds do the peers accept each other's gossip. A failed handshake is retried up to three times, after which the peer is marked unreachable until something changes. See Clustering for the full trust model.

Common errors

Connection refused

Error: handshake failed: connection refused

The TCP connection to the remote agent could not be made. Confirm the remote agent is up (curl http://REMOTE_IP:8080/livez), that the host is reachable (ping REMOTE_IP), and that TCP 8080 is open.

Connection timeout

Error: handshake failed: connection timeout

A path exists but the connection does not complete. Look for a firewall dropping packets, a NAT in the way, or the remote agent listening on a different interface than the one you are reaching.

License mismatch

Error: handshake failed: license mismatch

The two agents were bootstrapped with licenses that carry different customer identities, so they refuse to join the same cluster. Compare the license on each agent:

# Local agent
arctic license status

# Remote agent
arctic license status --url http://REMOTE_IP:8080

If they differ, re-bootstrap one agent with the correct license.

Invalid signature

Error: handshake failed: invalid signature

The peer's signature did not verify against the keys it should have. This points to a corrupted or replaced peer key. Re-bootstrap the affected agent, and contact support if it recurs.

Peer already exists

Error: peer already exists in cluster

The peer is already in the cluster. List peers to confirm:

arctic peers list

If you genuinely need to re-add it, delete it first:

arctic peers delete PEER_ID --yes

Node limit exceeded

Error: handshake failed: node limit exceeded

The license caps the number of nodes and the cluster is at that cap. Check the limit:

arctic license status

Remove unused peers to free a slot, or update to a license with a higher node count.

Debugging steps

Run the failing command with debug output, or trace the HTTP exchange:

arctic peers add REMOTE_IP:8080 --debug
arctic peers add REMOTE_IP:8080 --trace

Watch the logs on both agents while the handshake runs:

# Local agent
journalctl -u arctic -f

# Remote agent
ssh user@REMOTE_IP journalctl -u arctic -f

Read the remote agent's cluster identity. This endpoint needs no authentication, which makes it a quick way to confirm what cluster the remote agent thinks it belongs to:

curl http://REMOTE_IP:8080/v1/cluster/identity
{
  "peer_id": "peer_01HXYZ...",
  "public_key": "base64...",
  "license_id": "lic_...",
  "cluster_id": "clu_01HABC...",
  "version": "v1.4.0"
}

Confirm license_id matches the rest of your cluster. A handshake needs traffic in both directions, so test the reverse path too:

# Local to remote
curl http://REMOTE_IP:8080/livez

# Remote to local
ssh user@REMOTE_IP curl http://LOCAL_IP:8080/livez

Firewall requirements

PortProtocolDirectionPurpose
8080TCPBidirectionalOperator API and peer handshake
51840UDPBidirectionalIP tunnel (non-TCP traffic)

If agents sit behind NAT, forward TCP 8080 to each agent, give the public address when you add the peer, and keep the mapping stable.

Recovery steps

If handshakes keep failing after the checks above, restart both agents:

systemctl restart arctic

As a last resort, re-bootstrap an agent. This drops its local database and all state stored only on that node:

systemctl stop arctic
rm /opt/tillered/arctic.db
systemctl start arctic
arctic bootstrap --url http://localhost:8080 --license-file license.json

If the problem survives a re-bootstrap, contact support.

Configuration not applied

How configuration is applied

A change you make through the API or arctic compose apply does not touch the kernel directly. It flows like this:

  1. The change is written to the agent's SQLite database.
  2. The write fires an event, and the relevant reconciler wakes up.
  3. The reconciler computes the desired state and applies it: the network reconciler manages MACVLAN interfaces, the firewall reconciler commits nftables rules over netlink, and the TProxy and IP-tunnel reconcilers push fresh config into their in-process engines.

There are no generated config files and nothing to reload by hand. If applied state drifts from the database, the fix is to get the reconciler to run again, not to edit a file.

Symptoms

  • A service was created but traffic is not routed.
  • Routes were updated but the old routing still applies.
  • A bandwidth limit is not taking effect.
  • A requires_interface service has no interface.

Diagnosis

Force a cluster sync and give it a few seconds:

arctic cluster sync
curl -X POST http://AGENT_IP:8080/v1/cluster/sync \
  -H "Authorization: Bearer $TOKEN"

Read the reconciler logs. Each reconciler tags its log lines with a reconciler field, so you can filter to the one you care about:

journalctl -u arctic | grep -E 'reconciler=(network|firewall|tproxy|iptun)'

Inspect the kernel state the agent should have produced:

# Firewall classification and tunnel marking
nft list table inet arctic
nft list table inet arctic_iptun

# Service interfaces
ip link show

Common issues

Firewall rules missing

Symptoms: nft list table inet arctic does not show the rules you expect.

The agent owns these tables and rewrites them on every reconcile; you do not load them yourself. Look for an error in the firewall reconciler's log, then restart the agent to force a full rebuild:

journalctl -u arctic | grep 'reconciler=firewall'
systemctl restart arctic

TProxy engine not applying config

Symptoms: TCP routing reflects old service definitions.

Check the TProxy reconciler and engine log, then restart to re-apply from a clean state:

journalctl -u arctic | grep 'reconciler=tproxy'
systemctl restart arctic

IP tunnel not applying config

Symptoms: tunnels for non-TCP traffic are not established.

The tunnel runs inside the agent, so there is no separate interface or service to inspect. Confirm UDP 51840 is reachable between the peers, then read the reconciler log:

nc -u PEER_IP 51840
journalctl -u arctic | grep 'reconciler=iptun'

Restart the agent if the log shows the engine failing to start or apply.

MACVLAN interface missing

Symptoms: a requires_interface service has no interface.

journalctl -u arctic | grep 'reconciler=network'

Confirm the parent interface exists and that the interface name does not collide with an existing device.

Applied state does not match the database

Sometimes the database holds the right data but the kernel does not reflect it. Confirm what the database actually contains:

arctic services list --json
arctic routes list --service SERVICE_ID --json

If the data is correct, the reconciler either errored or never ran. Restart the agent to force every reconciler through a full pass:

systemctl restart arctic

How long changes take

A change normally applies within a second or two: the database write fires an event and the reconciler runs immediately. As a backstop, every core reconciler also resyncs on a 60-second timer, so a dropped event still self-corrects within a minute. Cluster-wide changes additionally need a gossip round to reach other peers; arctic cluster sync forces that round instead of waiting for the next heartbeat.

Collecting debug information

journalctl -u arctic --since "10 minutes ago"
arctic services list --json
arctic routes list --service SERVICE_ID --json
nft list table inet arctic
nft list table inet arctic_iptun
systemctl status arctic

See also

  • Upgrades - upgrading agents and the CLI
  • Recovery - break-glass access and clearing a stuck cluster lock
  • Clustering - the trust and gossip model behind handshakes

On this page