My Works About

My Resume

Back

Data Infrastructure

Airflow ETL & Backup Pipeline

Summary

(6 min read)

A production data platform built on HashiCorp Nomad, running Apache Airflow with Vault-backed connection strings. ETL DAGs ingest from HR, Records, and Service Desk databases into a central warehouse surfaced through Metabase. Automated nightly backups ship to Backblaze B2 — covering every production database.

Project Snapshot

My Role

DevOps & Data Platform Engineer

Duration

6 weeks · 2025

Context

Ghana School of Law

Outcome

Analytics platform live · 100% database backup coverage · zero credentials in DAG code

Stack

Apache AirflowNomadPostgreSQLVaultMetabaseBackblaze B2BashRedis

The Problem

Context

The school had multiple PostgreSQL databases (HR, records, service desk) with no analytics layer and no backup strategy. Business decisions were made without data.

The Pain

No visibility into school operations. Manual reporting took days. Any database server failure would result in permanent, unrecoverable data loss.

Why It Mattered

Student enrollment data, exam records, and HR information with no disaster recovery — a single drive failure away from catastrophic institutional data loss.

Goals & Requirements

Technical Goals

Deploy Airflow on the existing Nomad cluster — no new infrastructure
ETL from 3 PostgreSQL source systems into a central analytics warehouse
Automated nightly backups to off-site cloud storage
Zero credentials in DAG code or environment files
Analyst team manages DAGs independently from infrastructure changes

Constraints

Must run on existing cluster — zero new infrastructure budget
Analyst-owned DAG repository must stay isolated from infrastructure concerns

Architecture Design

Airflow deployed as Nomad services (scheduler, webserver, workers) with Vault-backed connection URIs. A pull-based DAG sync model isolates analyst workflows from infrastructure. Backup jobs run as Nomad periodic tasks — one per database, independent failure domains.

Scroll horizontally on smaller screens to view full diagram

Component Breakdown

Apache Airflow

DAG orchestration — schedules and executes ETL and backup tasks

HashiCorp Vault

Supplies database connection strings at task execution time — zero credentials stored in Airflow or DAG files

PostgreSQL Sources

HR, Records Management, and Service Desk source databases feeding the ETL pipeline

Metabase

BI dashboard layer surfacing warehouse data to school leadership and operations staff

Backblaze B2

Off-site backup destination — encrypted nightly dumps from every production database

Key Design Decisions

→Pull-based DAG sync from a separate analyst repository

Analysts own the DAG repository independently. A cron job on the Airflow worker pulls from the analyst repo — analysts can update and deploy DAGs without ever needing cluster access.

→Vault-backed connection URIs over Airflow's connections UI

Airflow's connections UI stores credentials in its metadata database. Vault injection means credentials are fetched at task run time — not persisted anywhere in Airflow, not visible in logs.

Implementation Breakdown

Airflow on Nomad

Airflow scheduler, webserver, and Celery workers deployed as separate Nomad services with Vault-injected secrets.

Separate Nomad jobs for scheduler, webserver, and Celery workers — independent scaling
Vault policy grants Airflow read-only access to connection string secrets
Redis deployed as a Nomad service for the Celery broker
Airflow metadata PostgreSQL runs on the existing db-01 node

ETL DAGs

Incremental ETL DAGs pulling from source databases into a central warehouse schema daily.

Daily incremental loads with full-refresh fallback for schema changes
Schema validation step before loading to catch upstream structural changes
Alerting on DAG failure via webhook notification
Pull-based DAG sync — cron job pulls from analyst repo to worker every 5 minutes

Backup Pipeline

Nomad periodic jobs run pg_dump on a nightly schedule, encrypt, and ship to Backblaze B2.

One periodic Nomad job per database — independent failure domains, separate logs
GPG encryption before upload — backups are encrypted at rest in B2
Restore procedure tested against a separate PostgreSQL instance monthly
B2 lifecycle rules retain 30 days of daily backups

Challenges & Solutions

#1Airflow DB migration race condition on Nomad cluster restart

The Problem

Airflow runs database migrations on startup. On Nomad, after a cluster restart, the scheduler and webserver can start simultaneously — both attempt the migration and one fails with a lock conflict.

The Fix

Added a dedicated pre-start migration Nomad job that runs to completion before the main Airflow services. Nomad lifecycle prestart hooks ensure the migration finishes before any Airflow process starts.

Results & Impact

Gave the school its first analytics platform and eliminated data loss risk across all production databases.

Before → After

Database Backup Coverage

0%100%

Full coverage

Analytics Reporting

Manual, daysAutomated dashboards

Live insights

Credentials in DAG code

PlaintextVault-only

Zero exposure

Business Outcome

School leadership has live dashboards for enrollment, HR, and service desk operations. Any database can be restored to the previous night within 30 minutes.

Reflections

Would Do Differently

01Add data quality checks (row counts, null rates) into the ETL pipeline from day one
02Use Airflow's Vault backend plugin instead of template injection for a cleaner secrets integration

Key Takeaways

01Running Airflow on Nomad requires careful attention to startup ordering — lifecycle prestart hooks are the right tool, not sleep timers
02The pull-based DAG sync model was the right call — analysts iterate freely without ever needing cluster access or infrastructure knowledge

Next Project

Multi-Region DR Platform

Solutions Architecture

Summary

Project Snapshot

The Problem

Goals & Requirements

Architecture Design

Component Breakdown

Key Design Decisions

Implementation Breakdown

Airflow on Nomad

ETL DAGs

Backup Pipeline

Challenges & Solutions

#1Airflow DB migration race condition on Nomad cluster restart

Results & Impact

Reflections

Multi-Region DR Platform

Thanks for Reading