Back
Platform Engineering

Production Nomad Cluster

Production Nomad Cluster architecture diagram

Summary

(8 min read)

A ground-up platform engineering project for Ghana School of Law — designing and operating a production-grade hybrid Nomad cluster that eliminated manual deployments, removed all secrets from repositories, and cut build times by 16×. The platform runs 10+ services across two physical locations with zero open ports.

Project Snapshot

My Role

Sole Platform & DevOps Engineer

Duration

3 months · 2025

Context

Ghana School of Law

Outcome

10+ workloads · 2 locations · zero open ports · 16× faster builds

Stack

NomadConsulVaultTraefikCloudflare Zero TrustGitHub ActionsDockerOVH CloudOrbStack

The Problem

Context

Ghana School of Law had no formal infrastructure. Applications ran on ad-hoc servers, deployments were manual SSH sessions, and secrets were hardcoded in environment files or committed directly to repositories.

The Pain

Every deployment was a manual SSH session. Secrets lived in plaintext files. There was no way to know a service was down until users complained. A single hardware failure would have taken every digital service offline with no recovery plan.

Why It Mattered

The school runs 10+ production services — student portal, HR system, records management, and internal tools — serving hundreds of students and staff daily. Downtime directly impacts enrollment, examinations, and day-to-day administration.

Goals & Requirements

Technical Goals

  • Zero public IP exposure — all traffic through Cloudflare Zero Trust tunnels
  • Zero secrets in repositories, environment files, or Docker images
  • Automated deployments on merge to main — no manual SSH required
  • Daily off-site database backups with verified restore capability
  • Service health visibility via dashboards
  • Warm standby across two physical locations — RTO under 30 minutes

Constraints

  • No budget for managed cloud — must run on existing Mac Studio + low-cost bare metal
  • Single engineer — must be fully operable and recoverable by one person
  • Existing applications must migrate with zero downtime
  • ARM64 on-premises (Mac Studio) vs x86_64 cloud — multi-architecture builds required

Architecture Design

A hybrid two-location Nomad cluster with a shared HashiCorp control plane (Nomad + Consul + Vault), zero-trust ingress via Cloudflare tunnels, and GitOps-style deployments through GitHub Actions. No port is ever opened on either server — all traffic flows outbound through Cloudflare. The Mac Studio runs the warm standby; OVH bare-metal runs active production.

Architecture Diagram

Scroll horizontally on smaller screens to view full diagram

Component Breakdown

HashiCorp Nomad

Job scheduler — runs all 10+ production workloads as Docker containers with health-checked routing

HashiCorp Consul

Service discovery and health checking — Traefik reads the service registry dynamically, no manual routing config

HashiCorp Vault

Centralized secrets engine — KV v2 for static secrets, JWT workload auth so containers never hold static credentials

Traefik

Reverse proxy — dynamically routes traffic based on Consul service registrations and handles TLS termination

Cloudflare Zero Trust Tunnel

Public ingress — outbound-only persistent connection, no inbound ports open on either server

GitHub Actions + GHCR

CI/CD — builds multi-arch Docker images, pushes to GHCR, triggers Nomad job updates on merge to main

OrbStack (Mac Studio)

Type-2 hypervisor on macOS — runs 7 Ubuntu ARM64 VMs mirroring production exactly as a warm standby

Proxmox VE (OVH)

Type-1 hypervisor on bare metal — runs 7 Ubuntu x86_64 VMs as the active production environment

Key Design Decisions

Nomad over Kubernetes

Single-engineer operation requires simplicity. Nomad has no etcd, no separate control plane complexity. A K8s cluster would take days to recover from scratch; Nomad takes hours. The operational overhead difference is night and day at this scale.

Cloudflare Zero Trust over VPN or open ports

Zero public IP means zero attack surface. Traditional setups require open ports or a cloud load balancer — both add cost and risk. Cloudflare tunnels are free, outbound-only, and survive NAT and firewalls. Perfect for a hybrid setup where the on-prem node sits behind a home router.

HashiCorp Vault over cloud secrets managers

Self-hosted means no AWS dependency for secrets. Vault's JWT workload auth means containers never see a static credential — they authenticate at startup and receive exactly what they need for that run. Eliminates the entire class of 'credentials leaked in logs' incidents.

OrbStack for on-premises VMs

Mac Studio + OrbStack gives ARM64 VMs with near-native performance at zero cost. OrbStack is dramatically faster and lighter than VMware Fusion. The warm standby mirrors production exactly — same VM count, same job specs, same config.

Implementation Breakdown

01

Infrastructure as Code

All Nomad job specs, Consul service definitions, and Vault policies are version-controlled as HCL. The entire cluster state is reproducible from the repository.

  • One Nomad HCL job spec per service — parameterized with an image tag variable
  • Vault policies defined as code — least-privilege scoped per workload identity
  • Consul service definitions co-located with job specs
  • Bootstrap Bash scripts for new node provisioning — full cluster from scratch in under 2 hours
Nomad job with Vault secret injectionhcl
job "api-server" {
  datacenters = ["prod"]
  type        = "service"

  group "app" {
    task "server" {
      driver = "docker"

      config {
        image = "ghcr.io/org/api:${var.image_tag}"
      }

      vault { policies = ["api-server-policy"] }

      template {
        data        = <<EOH
{{ with secret "kv/data/api-server" }}
DB_PASSWORD={{ .Data.data.db_password }}
REDIS_URL={{ .Data.data.redis_url }}
{{ end }}
EOH
        destination = "secrets/.env"
        env         = true
      }
    }
  }
}
02

CI/CD Pipeline

GitHub Actions builds multi-architecture Docker images on self-hosted runners, pushes to GHCR, and triggers Nomad job updates on merge to main — zero manual steps.

  • Multi-stage Docker builds reduced image sizes 60–80% and eliminated bloated layers
  • Shared base-image strategy cut redundant dependency installation across every service
  • Self-hosted ARM64 runner on Mac Studio for on-prem builds
  • Self-hosted x86_64 runner on OVH for production builds
  • GitOps: merge to main → automatic deploy, no manual SSH, ever
GitHub Actions build & deploy workflowyaml
name: Build & Deploy
on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: [self-hosted, linux, ARM64]
    steps:
      - uses: actions/checkout@v4

      - name: Build & push image
        run: |
          docker build -t ghcr.io/org/api:${{ github.sha }} .
          docker push ghcr.io/org/api:${{ github.sha }}

      - name: Deploy to Nomad
        run: |
          nomad job run \
            -var="image_tag=${{ github.sha }}" \
            jobs/api.nomad
03

Zero Trust Networking

All public traffic routes through Cloudflare Zero Trust Tunnels — no port is open on either server. Traefik handles TLS termination and dynamic routing populated from the Consul service registry.

  • cloudflared runs as a Nomad service on the router VM in each environment
  • Traefik reads routes from Consul — no manual config needed when a new service starts
  • All *.gslaw.school domains route through Cloudflare — zero direct IP exposure
  • Internal service-to-service traffic uses Consul service mesh — no hardcoded IPs
04

Secrets Management

HashiCorp Vault eliminated all plaintext secrets. Containers authenticate via JWT at startup and receive exactly the credentials they need — nothing more.

  • KV v2 engine for static secrets — database passwords, API keys, third-party tokens
  • JWT workload auth — each Nomad job carries a unique identity token scoped to its policy
  • Vault policies enforce least-privilege — api-server cannot read database admin secrets
  • Zero secrets in .env files, Docker images, CI/CD logs, or git history
05

Observability & Backups

Prometheus and Grafana provide cluster-wide metrics visibility. Automated nightly PostgreSQL dumps ship encrypted to Backblaze B2 via Nomad periodic jobs.

  • Prometheus scrapes Nomad, Consul, and application health metrics
  • Grafana dashboards for cluster health, job status, and resource utilization
  • Nomad periodic job runs pg_dump nightly, encrypts output, ships to Backblaze B2
  • Every production database covered — restore tested and verified monthly

Challenges & Solutions

#1Multi-architecture builds: ARM64 on-prem vs x86_64 in cloud

The Problem

The Mac Studio uses Apple Silicon (ARM64). The OVH bare-metal server is x86_64. Docker images built for one architecture won't run on the other. Initial CI builds either failed on the wrong runner or took 4+ hours using QEMU emulation for cross-compilation.

The Fix

Added a self-hosted GitHub Actions runner on each machine — ARM64 on Mac Studio, x86_64 on OVH. Each runner builds images only for its native architecture. Introduced a shared base-image strategy to eliminate redundant dependency installation. Build times dropped from 4+ hours to under 15 minutes.

#2Vault bootstrap: the chicken-and-egg unseal problem

The Problem

Vault needs to be initialized and unsealed before any service can fetch secrets. But the unseal keys are themselves sensitive. During initial bootstrap, a single wrong step can lock you out of the entire secrets engine — and recovering from a sealed Vault in production is a stressful, manual process.

The Fix

Documented the bootstrap sequence step-by-step with explicit checkpoints. Stored encrypted unseal key shards in a secure offline location separate from the cluster. Wrote a health-check script that alerts within 60 seconds if Vault becomes sealed unexpectedly. Bootstrap is now a 15-minute repeatable process any engineer can follow.

#3Consul health check race conditions on rolling deploy

The Problem

During rolling deploys, Traefik would briefly route traffic to containers that hadn't finished starting up, causing intermittent 502 errors. Consul health checks were registering services as healthy before the application was truly ready to serve traffic.

The Fix

Added startup health-check grace periods in Nomad job specs and tightened Consul check intervals from 30s to 5s with a 3-failure deregistration threshold. Implemented TCP health checks alongside HTTP — a service must pass both before Traefik routes to it. 502 errors dropped to zero on the next deploy.

Results & Impact

The platform transformed Ghana School of Law from ad-hoc manual deployments to a fully automated, observable, and secure production infrastructure — all operated by a single engineer.

Before → After

Build Time

4+ hours< 15 min
16× faster

Deployment Process

Manual SSHMerge to main
Fully automated

Secrets Exposure

Plaintext in reposVault-only
Zero exposure

Public Attack Surface

Open portsZero open ports
Eliminated

Database Backups

NoneNightly automated
Full coverage

Recovery Time

Unknown / days< 30 min
Warm standby

Business Outcome

The school ships updates to production in under 15 minutes with confidence. A hardware failure on either location can be recovered from in under 30 minutes by promoting the standby. The platform has been running without incident since launch.

Reflections

Would Do Differently

  • 01Automate the Vault unseal process from day one — the manual unseal step is the only part of the cluster not fully automated, and it's a single-engineer dependency I want to eliminate
  • 02Add Nomad Autoscaler earlier — scaling is currently manual, which means keeping capacity headroom in my head at all times
  • 03Implement structured JSON logging from day one rather than retrofitting — adding log aggregation to services not built with it is significantly harder

Key Takeaways

  • 01Zero Trust networking is not a 'nice to have' for a single-engineer shop — it's the only way to secure a multi-location cluster without a dedicated network team or a cloud VPN budget
  • 02Nomad's simplicity is a genuine engineering advantage, not a compromise — a midnight cluster recovery drill confirmed that a self-hosted Nomad cluster is recoverable by one person in under an hour
  • 03The shared base-image strategy was the single highest-leverage CI optimization — one change simultaneously cut build times across every service in the cluster

Next Project

Airflow ETL & Backup Pipeline

Data Infrastructure

Airflow ETL & Backup Pipeline

Thanks for Reading