A ground-up platform engineering project for Ghana School of Law — designing and operating a production-grade hybrid Nomad cluster that eliminated manual deployments, removed all secrets from repositories, and cut build times by 16×. The platform runs 10+ services across two physical locations with zero open ports.
My Role
Sole Platform & DevOps Engineer
Duration
3 months · 2025
Context
Ghana School of Law
Outcome
10+ workloads · 2 locations · zero open ports · 16× faster builds
Stack
Context
Ghana School of Law had no formal infrastructure. Applications ran on ad-hoc servers, deployments were manual SSH sessions, and secrets were hardcoded in environment files or committed directly to repositories.
The Pain
Every deployment was a manual SSH session. Secrets lived in plaintext files. There was no way to know a service was down until users complained. A single hardware failure would have taken every digital service offline with no recovery plan.
Why It Mattered
The school runs 10+ production services — student portal, HR system, records management, and internal tools — serving hundreds of students and staff daily. Downtime directly impacts enrollment, examinations, and day-to-day administration.
Technical Goals
Constraints
A hybrid two-location Nomad cluster with a shared HashiCorp control plane (Nomad + Consul + Vault), zero-trust ingress via Cloudflare tunnels, and GitOps-style deployments through GitHub Actions. No port is ever opened on either server — all traffic flows outbound through Cloudflare. The Mac Studio runs the warm standby; OVH bare-metal runs active production.
Scroll horizontally on smaller screens to view full diagram
HashiCorp Nomad
Job scheduler — runs all 10+ production workloads as Docker containers with health-checked routing
HashiCorp Consul
Service discovery and health checking — Traefik reads the service registry dynamically, no manual routing config
HashiCorp Vault
Centralized secrets engine — KV v2 for static secrets, JWT workload auth so containers never hold static credentials
Traefik
Reverse proxy — dynamically routes traffic based on Consul service registrations and handles TLS termination
Cloudflare Zero Trust Tunnel
Public ingress — outbound-only persistent connection, no inbound ports open on either server
GitHub Actions + GHCR
CI/CD — builds multi-arch Docker images, pushes to GHCR, triggers Nomad job updates on merge to main
OrbStack (Mac Studio)
Type-2 hypervisor on macOS — runs 7 Ubuntu ARM64 VMs mirroring production exactly as a warm standby
Proxmox VE (OVH)
Type-1 hypervisor on bare metal — runs 7 Ubuntu x86_64 VMs as the active production environment
→Nomad over Kubernetes
Single-engineer operation requires simplicity. Nomad has no etcd, no separate control plane complexity. A K8s cluster would take days to recover from scratch; Nomad takes hours. The operational overhead difference is night and day at this scale.
→Cloudflare Zero Trust over VPN or open ports
Zero public IP means zero attack surface. Traditional setups require open ports or a cloud load balancer — both add cost and risk. Cloudflare tunnels are free, outbound-only, and survive NAT and firewalls. Perfect for a hybrid setup where the on-prem node sits behind a home router.
→HashiCorp Vault over cloud secrets managers
Self-hosted means no AWS dependency for secrets. Vault's JWT workload auth means containers never see a static credential — they authenticate at startup and receive exactly what they need for that run. Eliminates the entire class of 'credentials leaked in logs' incidents.
→OrbStack for on-premises VMs
Mac Studio + OrbStack gives ARM64 VMs with near-native performance at zero cost. OrbStack is dramatically faster and lighter than VMware Fusion. The warm standby mirrors production exactly — same VM count, same job specs, same config.
All Nomad job specs, Consul service definitions, and Vault policies are version-controlled as HCL. The entire cluster state is reproducible from the repository.
job "api-server" {
datacenters = ["prod"]
type = "service"
group "app" {
task "server" {
driver = "docker"
config {
image = "ghcr.io/org/api:${var.image_tag}"
}
vault { policies = ["api-server-policy"] }
template {
data = <<EOH
{{ with secret "kv/data/api-server" }}
DB_PASSWORD={{ .Data.data.db_password }}
REDIS_URL={{ .Data.data.redis_url }}
{{ end }}
EOH
destination = "secrets/.env"
env = true
}
}
}
}GitHub Actions builds multi-architecture Docker images on self-hosted runners, pushes to GHCR, and triggers Nomad job updates on merge to main — zero manual steps.
name: Build & Deploy
on:
push:
branches: [main]
jobs:
deploy:
runs-on: [self-hosted, linux, ARM64]
steps:
- uses: actions/checkout@v4
- name: Build & push image
run: |
docker build -t ghcr.io/org/api:${{ github.sha }} .
docker push ghcr.io/org/api:${{ github.sha }}
- name: Deploy to Nomad
run: |
nomad job run \
-var="image_tag=${{ github.sha }}" \
jobs/api.nomadAll public traffic routes through Cloudflare Zero Trust Tunnels — no port is open on either server. Traefik handles TLS termination and dynamic routing populated from the Consul service registry.
HashiCorp Vault eliminated all plaintext secrets. Containers authenticate via JWT at startup and receive exactly the credentials they need — nothing more.
Prometheus and Grafana provide cluster-wide metrics visibility. Automated nightly PostgreSQL dumps ship encrypted to Backblaze B2 via Nomad periodic jobs.
The Problem
The Mac Studio uses Apple Silicon (ARM64). The OVH bare-metal server is x86_64. Docker images built for one architecture won't run on the other. Initial CI builds either failed on the wrong runner or took 4+ hours using QEMU emulation for cross-compilation.
The Fix
Added a self-hosted GitHub Actions runner on each machine — ARM64 on Mac Studio, x86_64 on OVH. Each runner builds images only for its native architecture. Introduced a shared base-image strategy to eliminate redundant dependency installation. Build times dropped from 4+ hours to under 15 minutes.
The Problem
Vault needs to be initialized and unsealed before any service can fetch secrets. But the unseal keys are themselves sensitive. During initial bootstrap, a single wrong step can lock you out of the entire secrets engine — and recovering from a sealed Vault in production is a stressful, manual process.
The Fix
Documented the bootstrap sequence step-by-step with explicit checkpoints. Stored encrypted unseal key shards in a secure offline location separate from the cluster. Wrote a health-check script that alerts within 60 seconds if Vault becomes sealed unexpectedly. Bootstrap is now a 15-minute repeatable process any engineer can follow.
The Problem
During rolling deploys, Traefik would briefly route traffic to containers that hadn't finished starting up, causing intermittent 502 errors. Consul health checks were registering services as healthy before the application was truly ready to serve traffic.
The Fix
Added startup health-check grace periods in Nomad job specs and tightened Consul check intervals from 30s to 5s with a 3-failure deregistration threshold. Implemented TCP health checks alongside HTTP — a service must pass both before Traefik routes to it. 502 errors dropped to zero on the next deploy.
The platform transformed Ghana School of Law from ad-hoc manual deployments to a fully automated, observable, and secure production infrastructure — all operated by a single engineer.
Before → After
Build Time
Deployment Process
Secrets Exposure
Public Attack Surface
Database Backups
Recovery Time
Business Outcome
The school ships updates to production in under 15 minutes with confidence. A hardware failure on either location can be recovered from in under 30 minutes by promoting the standby. The platform has been running without incident since launch.
Would Do Differently
Key Takeaways
Next Project
Data Infrastructure