Kona

What's under the hood

If you run a rescue on your own server and are thinking about moving to Kona, this page is for you. It's a plain accounting of the infrastructure the site runs on, the redundancy layers that sit behind it, and the gaps that are still being worked on. No marketing numbers. Written for people who already know what half of these words mean.

Compute

The site runs on a 3-node Kubernetes cluster (k3s) with embedded etcd. All three nodes are control-plane plus etcd members, so losing any one of them leaves the cluster writable and the apps running. They live on three separate physical hosts, two of which are Proxmox hypervisors. The third is a bare-metal box that also hosts a local LLM for the AI-assisted features. Each node is sized to run the full public and hub workload on its own, so a failover from any one of the others doesn't degrade service beyond brief rescheduling latency.

Postgres is pinned to the primary node by default via nodeAffinity, which keeps the hot write path off the network and on local NVMe. That's a latency choice, not a capacity constraint: if the primary is taken offline for maintenance or a hardware event, Postgres reschedules onto another node and the site keeps serving. Web pods float by default. Traefik runs as a DaemonSet-style service across the cluster so the front-door proxy isn't pinned to any one node either.

Local AI

Kona's AI-assisted features (drafting adoption bios from photos and foster notes, writing alt-text photo descriptions, reviewing applicant free-text answers and surfacing red flags, drafting interview questions tailored to a specific applicant and a specific dog, proposing social-media post copy, extracting structured fields from uploaded vet documents, and the in-app chat assistant that can reason against an individual dog's record) all run on local GPUs on the same rack as the rest of the stack. No adopter data, no applicant answers, and no dog photos are sent to a cloud AI API. That's an intentional constraint. Model weights and the inference runtime are both on-site.

Inference is routed through a priority plus queue-aware dispatcher across a small GPU fleet:

Scaling this layer is a promotion path, not a rewrite. When the 5070 Ti is no longer enough on its own, the next step is to move the rack-hosted RTX 5090 off standby and dedicate it to Kona full-time. If that ceiling is reached, the RTX PRO 6000 Blackwell follows on a temporary basis while longer-term capacity decisions get made. The scheduler, the failover, and the health checks that drive all of this are already in place; growing the dedicated tier is a configuration change, not a project.

Storage

Postgres data lives on local NVMe on the primary node, not on NFS. The commit-path fsync latency matters more than the bytes-per-second number on a rescue workload, and pulling NFS out of that path cut per-commit fsync from roughly 12 ms to roughly 3 ms. Consumer NVMe plus full-disk encryption floors at around that 3 ms, so there's no deeper well to drill without enterprise drives with power-loss protection. It hasn't been the bottleneck in practice.

Shared state that needs to outlive a single node (media uploads, backup artifacts, the in-cluster object store) lives on a ZFS pool on a separate NAS, served over NFS. The pool is three-way RAIDZ2, so any two drives in a vdev can fail before data is at risk, and the dataset is encrypted at rest. Kubernetes consumes it via static PersistentVolumes plus the nfs-csi driver.

The read path for uploaded images gets two layers of caching stacked on top of that NFS backing. Every cluster node runs the Linux kernel's FS-Cache layer with cachefilesd, mounting the upload share with fsc,nosharecache so that the first read of a given file on a node pays the NFS round-trip and every subsequent read on the same node serves from that node's local NVMe transparently. Web pods spread across nodes via soft anti-affinity so each tenant's upload mount gets its own per-node cache. In front of that, Cloudflare holds /uploads/* at the edge for thirty days under a cache rule that respects the origin's Cache-Control header, which means the common case for public photo traffic never reaches the origin at all. When the edge does miss and the node cache is cold, the NFS pull is what the visitor waits on; every layer behind that is measured and bounded.

Network

The public internet only reaches the cluster through a Cloudflare Tunnel. There are no open inbound ports on the edge router, and the tunnel daemon runs as two pod replicas in the cluster so that losing one node doesn't drop the site. DNS, TLS termination, WAF, bot filtering, and rate-limiting all happen at Cloudflare before a request is ever brokered into the cluster.

Internally, the network is Ubiquiti gear. Each cluster node has a primary 10G SFP uplink and a backup 1G RJ45 uplink landing on a separate switch, so a switch failure or a cable pull drops the node to a degraded path rather than offline. Two gateways run a shadow-failover pair sitting on two independent WAN circuits: a symmetric fiber line as primary and a 5G connection as failover. An ISP outage, a fiber cut, or a gateway failure all cut over without manual intervention, at the cost of some bandwidth during the 5G fallback window.

Known gap: every switch in the rack currently uplinks to both gateways through a single SFP aggregation switch. A software update on that aggregation switch drops traffic for the length of its reboot. Two things soften the impact today: maintenance windows are announced in advance via the in-app notification system so rescues know when to expect a brief interruption, and the off-site standby server described below will continue to serve the core non-AI workflows during any aggregation-switch downtime. A plan to remove the aggregation switch from the critical path is the next meaningful change on the network side.

Power

The rack is on a UPS, which gives enough runtime to ride out the shorter outages that make up the great majority of utility interruptions. Longer outages are covered by a backup generator. The next incremental step is a second UPS on a separate circuit with an automatic transfer switch in front of the load, so that a UPS electronics failure is no longer a single event that takes the rack down. That hardware is in the queue; the circuit changes to accept it are the main dependency.

Backups and recovery

Three independent backup layers overlap.

Both paths have been end-to-end restore-tested on this cluster: a namespace wipe, a fresh Velero restore, a pg_restore on top, and the pods come back and serve traffic.

A separate off-site backup target over Syncthing to an out-of-site NAS is the next addition on this side. That NAS doesn't run a web server or take traffic, it's just cold storage for the SQL dumps and the kopia archives. After that, a small off-site standby server will come online that can serve the core rescue workflows (applications, medical records, photo uploads) at reduced capacity if the primary site is offline. It won't run the AI pieces, so AI-assisted features fall back to a simpler path there.

Security

Every storage device in the rack is encrypted at rest at the block or dataset level. Server NVMe drives use LUKS2 full-disk encryption; the NAS pool uses native ZFS dataset encryption. A drive pulled out of the rack for RMA or disposal is ciphertext without the key, and the key material lives in places that don't leave the premises.

On top of that, Kubernetes Secrets are additionally encrypted at rest in etcd (AES-CBC). The etcd encryption key is held in two places: on each k3s server, and in a sealed copy on the NAS that's independent of the cluster, so a total cluster wipe followed by an etcd-snapshot-restore can still decrypt the recovered secrets.

Runtime security monitoring is handled by Falco on every node, with alerts on anything at critical priority or above forwarded through a small webhook translator to a notification channel. Each application namespace has its own NetworkPolicies so that cross-tenant traffic and unexpected egress paths get dropped at the CNI level rather than relying on application-level trust.

Secrets that back the application (encryption keys for PII columns, OAuth client secrets, SMTP credentials) are stored in Kubernetes Secrets, never checked into git, and rotated when leaked. The public repository for the app itself is scanned with gitleaks before any commit that's meant to land publicly.

Observability

Prometheus and Grafana run in-cluster, scraping the kubelet, the k3s control plane where it exposes metrics, cAdvisor, and the application pods. Alertmanager routes pageable alerts through the same webhook translator mentioned above; noisy built-in rules like Watchdog are explicitly dropped so the channel only fires when something actually wants attention. Grafana's own state (users, datasources, dashboards) sits on a retained PVC that survived the most recent cluster rebuild without manual restoration.

What this setup isn't

It isn't a colocation deployment in a tier-IV data center. It isn't multi-region. It currently isn't even multi-site. Two of the three cluster nodes are VMs on two different hypervisors in the same rack, and the third node is a bare-metal box in the same rack. The network and power redundancy buys availability against the failure modes that actually happen at this scale (drives, power supplies, UPS batteries, NICs, single-circuit outages), not against catastrophic site loss. The off-site pieces called out above are what turn that corner.

If you're running a rescue on a small VPS somewhere and you're wondering whether this is better than that: probably, for the things most rescues actually need. If you're running on a well-maintained managed platform with a real SRE team, this isn't going to out-SRE them. The honest answer to most "is this as good as X" questions is "no, but it doesn't need to be for what we're doing here."