Kona

What's under the hood

If you run a rescue on your own server and are thinking about moving to Kona, this page is for you. It's a plain accounting of the infrastructure the site runs on, the redundancy layers that sit behind it, and the gaps that are still being worked on. No marketing numbers. Written for people who already know what half of these words mean.

Compute

The site runs on a 3-node Kubernetes cluster (k3s) with embedded etcd. All three nodes are control-plane plus etcd members, so losing any one of them leaves the cluster writable and the apps running. They live on three separate physical hosts, two of which are Proxmox hypervisors. The third is a bare-metal box that also carries the GPU dedicated to the AI-assisted features. Each node is sized to run the full public and hub workload on its own, so a failover from any one of the others doesn't degrade service beyond brief rescheduling latency.

Postgres runs under the CloudNativePG operator, one independent cluster per rescue tenant. Each cluster is three instances (a primary and two streaming standbys) placed on different physical nodes, on local NVMe, with synchronous replication: an acknowledged write is on at least two nodes before the client sees the commit, so a primary failure loses no committed data rather than merely failing over fast. If the node holding a primary dies or is taken down for maintenance, the operator promotes the standby and the app reconnects through the operator-managed service; no human is in that loop. Web pods float by default. Traefik runs as a DaemonSet-style service across the cluster so the front-door proxy isn't pinned to any one node either.

Local AI

Kona's AI-assisted features (drafting adoption bios from photos and foster notes, writing alt-text photo descriptions, reviewing applicant free-text answers and surfacing red flags, drafting interview questions tailored to a specific applicant and a specific dog, proposing social-media post copy, extracting structured fields from uploaded vet documents, and the in-app chat assistant that can reason against an individual dog's record) all run on local GPUs on the same rack as the rest of the stack. No adopter data, no applicant answers, and no dog photos are sent to a cloud AI API. That's an intentional constraint. Model weights and the inference runtime are both on-site.

Inference is routed through a priority plus queue-aware dispatcher across a small GPU fleet:

Scaling this layer is a promotion path, not a rewrite. When the dedicated 5090 is no longer enough on its own, the next step is to dedicate the RTX PRO 6000 Blackwell on a temporary basis while longer-term capacity decisions get made. The scheduler, the failover, and the health checks that drive all of this are already in place; growing the dedicated tier is a configuration change, not a project.

Kona also runs a self-hosted image-embedding model (CLIP-class) on the same GPUs, turning each approved dog photo into a vector stored in Postgres through the pgvector extension: visual dog matching done entirely on-site, with no third-party vision API in the path. Two features build on it: a vet-initiated care-team flow, where a treating vet can find a specific rescue dog (by photo, among other non-photo signals) and open a consent-gated link to its record and a sanctioned line of communication; and a lost-and-found search that matches an uploaded photo against the current dogs. The matching is always advisory and score-graded. It surfaces likely candidates for a human to confirm, and never asserts that two dogs are the same animal.

Storage

Postgres data lives on local NVMe on the nodes that host each cluster's instances, not on NFS. The commit-path fsync latency matters more than the bytes-per-second number on a rescue workload, and pulling NFS out of that path cut per-commit fsync from roughly 12 ms to roughly 3 ms. Consumer NVMe plus full-disk encryption floors at around that 3 ms, so there's no deeper well to drill without enterprise drives with power-loss protection. It hasn't been the bottleneck in practice.

Shared state that needs to outlive a single node (media uploads, backup artifacts, the in-cluster object store) lives on a ZFS pool on a separate NAS, served over NFS. The pool is three-way RAIDZ2, so any two drives in a vdev can fail before data is at risk, and the dataset is encrypted at rest. Kubernetes consumes it via static PersistentVolumes plus the nfs-csi driver.

The read path for uploaded images gets two layers of caching stacked on top of that NFS backing. Every cluster node runs the Linux kernel's FS-Cache layer with cachefilesd, mounting the upload share with fsc,nosharecache so that the first read of a given file on a node pays the NFS round-trip and every subsequent read on the same node serves from that node's local NVMe transparently. Web pods spread across nodes via soft anti-affinity so each tenant's upload mount gets its own per-node cache. In front of that, Cloudflare holds /uploads/* at the edge for thirty days under a cache rule that respects the origin's Cache-Control header, which means the common case for public photo traffic never reaches the origin at all. When the edge does miss and the node cache is cold, the NFS pull is what the visitor waits on; every layer behind that is measured and bounded.

Network

The public internet only reaches the cluster through a Cloudflare Tunnel. There are no open inbound ports on the edge router, and the tunnel daemon runs as two pod replicas in the cluster so that losing one node doesn't drop the site. DNS, TLS termination, WAF, bot filtering, and rate-limiting all happen at Cloudflare before a request is ever brokered into the cluster.

Internally, the network is Ubiquiti gear. Each cluster node has a primary 10G SFP uplink and a backup 1G RJ45 uplink landing on a separate switch, so a switch failure or a cable pull drops the node to a degraded path rather than offline. Two gateways run a shadow-failover pair sitting on two independent WAN circuits: a symmetric fiber line as primary and a 5G connection as failover. An ISP outage, a fiber cut, or a gateway failure all cut over without manual intervention, at the cost of some bandwidth during the 5G fallback window.

Known gap: every switch in the rack currently uplinks to both gateways through a single SFP aggregation switch. A software update on that aggregation switch drops traffic for the length of its reboot. Two things soften the impact today: maintenance windows are announced in advance via the in-app notification system so rescues know when to expect a brief interruption, and the off-site standby server described below will continue to serve the core non-AI workflows during any aggregation-switch downtime. A plan to remove the aggregation switch from the critical path is the next meaningful change on the network side.

Power

The rack is on a UPS, which gives enough runtime to ride out the shorter outages that make up the great majority of utility interruptions. Longer outages are covered by a backup generator. The next incremental step is a second UPS on a separate circuit with an automatic transfer switch in front of the load, so that a UPS electronics failure is no longer a single event that takes the rack down. That hardware is in the queue; the circuit changes to accept it are the main dependency.

Backups and recovery

Four independent backup layers overlap.

The in-cluster object store that holds the first and third layers (the WAL archive and the Velero snapshots) is itself replicated, server-side, to a second independent object store on a different physical host, so losing the host that runs the primary store doesn't take the backups with it. That replica is monitored for replication failures, not just for being reachable. A replica that is up but failing to receive copies pages, not only one that's outright down.

On the NAS itself, the underlying ZFS pool also takes periodic snapshots of the datasets that hold the dumps, the MinIO bucket, and the upload shares: cheap copy-on-write rollback even if a backup job wrote a bad archive. Every one of these restore paths has been exercised on this cluster, not left as a runbook assumption: a point-in-time recovery from the WAL archive, a standalone pg_restore from a nightly dump into an empty database, a pull of the SQL dumps back from the separate off-cluster restic server, and a full namespace rebuild from Velero with a pg_restore on top. In each case the data comes back and the pods serve traffic.

What's still missing is distance. Every copy above lives on the premises. True off-site replication, a copy that survives the building, is the next addition on this side. After that, a small off-site standby server will come online that can serve the core rescue workflows (applications, medical records, photo uploads) at reduced capacity if the primary site is offline. It won't run the AI pieces, so AI-assisted features fall back to a simpler path there.

Security

Every storage device in the rack is encrypted at rest at the block or dataset level. Server NVMe drives use LUKS2 full-disk encryption; the NAS pool uses native ZFS dataset encryption. A drive pulled out of the rack for RMA or disposal is ciphertext without the key, and the key material lives in places that don't leave the premises.

On top of that, Kubernetes Secrets are additionally encrypted at rest in etcd (AES-CBC). The etcd encryption key is held in two places: on each k3s server, and in a sealed copy on the NAS that's independent of the cluster, so a total cluster wipe followed by an etcd-snapshot-restore can still decrypt the recovered secrets.

Runtime security monitoring is handled by Falco on every node, with alerts on anything at critical priority or above forwarded through a small webhook translator to a notification channel. Each application namespace has its own NetworkPolicies so that cross-tenant traffic and unexpected egress paths get dropped at the CNI level rather than relying on application-level trust.

Secrets that back the application (encryption keys for PII columns, OAuth client secrets, SMTP credentials) are stored in Kubernetes Secrets, never checked into git, and rotated when leaked. The public repository for the app itself is scanned with gitleaks before any commit that's meant to land publicly.

Observability

Prometheus and Grafana run in-cluster, scraping the kubelet, the k3s control plane where it exposes metrics, cAdvisor, and the application pods. Alertmanager routes pageable alerts through the same webhook translator mentioned above; noisy built-in rules like Watchdog are explicitly dropped so the channel only fires when something actually wants attention. Grafana's own state (users, datasources, dashboards) sits on a retained PVC that survived the most recent cluster rebuild without manual restoration.

Alerts don't go straight to a page. They first pass through a triage layer that classifies each one, deduplicates it against whatever is already firing, and decides where it should go. Routine, well-understood conditions (a pod that needs a restart, a transient probe blip, a service already recovering on its own) are handled by a small model running locally on the same on-site GPUs as the rest of the AI stack, which can apply a fixed, pre-approved set of corrective actions and then verify the fix actually held. Anything it can't safely resolve, or anything above a severity threshold, escalates instead of being touched.

Escalation is tiered by severity. Most conditions are routed to a human as a notification with enough context attached to act on quickly. A narrow band of harder infrastructure problems can be handed up to a more capable frontier model for deeper diagnosis. That path is deliberately scoped to the infrastructure and the application's own behavior (logs, metrics, cluster state, service health) and explicitly not to any adopter, applicant, or animal-record data; it reasons about the platform, never about the people using it. Every automated action is logged, and the actions the automated tiers are allowed to take are a fixed, reviewed list, not an open-ended license to change things.

This layer was walked up deliberately rather than switched on at full autonomy: it began observe-only (classifying alerts and taking no action), and actions were enabled narrowly only after the routing had proven correct over time. The standing bias is to escalate when uncertain rather than to act, so that a real or early-warning signal is never written off as noise and a symptom is never patched in a way that hides its cause. The intent is to widen what a lone operator would otherwise miss, not to substitute for one.

Proving it

Redundancy you haven't tested is a guess. The failure modes above are exercised deliberately: the failure is injected on the live cluster, under a synthetic load against the public site, and what the site actually does is measured. Each one is a repeatable script, not a one-time stunt, so the same failure can be re-run after any change that might have quietly regressed it.

What this setup isn't

It isn't a colocation deployment in a tier-IV data center. It isn't multi-region. It currently isn't even multi-site. Two of the three cluster nodes are VMs on two different hypervisors in the same rack, and the third node is a bare-metal box in the same rack. The network and power redundancy buys availability against the failure modes that actually happen at this scale (drives, power supplies, UPS batteries, NICs, single-circuit outages), not against catastrophic site loss. The off-site pieces called out above are what turn that corner.

If you're running a rescue on a small VPS somewhere and you're wondering whether this is better than that: probably, for the things most rescues actually need. If you're running on a well-maintained managed platform with a real SRE team, this isn't going to out-SRE them. The honest answer to most "is this as good as X" questions is "no, but it doesn't need to be for what we're doing here."