Back
Data Infrastructure

Airflow ETL & Backup Pipeline

Airflow ETL & Backup Pipeline architecture diagram

Summary

(6 min read)

A production data platform built on HashiCorp Nomad, running Apache Airflow with Vault-backed connection strings. ETL DAGs ingest from HR, Records, and Service Desk databases into a central warehouse surfaced through Metabase. Automated nightly backups ship to Backblaze B2 — covering every production database.

Project Snapshot

My Role

DevOps & Data Platform Engineer

Duration

6 weeks · 2025

Context

Ghana School of Law

Outcome

Analytics platform live · 100% database backup coverage · zero credentials in DAG code

Stack

Apache AirflowNomadPostgreSQLVaultMetabaseBackblaze B2BashRedis

The Problem

Context

The school had multiple PostgreSQL databases (HR, records, service desk) with no analytics layer and no backup strategy. Business decisions were made without data.

The Pain

No visibility into school operations. Manual reporting took days. Any database server failure would result in permanent, unrecoverable data loss.

Why It Mattered

Student enrollment data, exam records, and HR information with no disaster recovery — a single drive failure away from catastrophic institutional data loss.

Goals & Requirements

Technical Goals

  • Deploy Airflow on the existing Nomad cluster — no new infrastructure
  • ETL from 3 PostgreSQL source systems into a central analytics warehouse
  • Automated nightly backups to off-site cloud storage
  • Zero credentials in DAG code or environment files
  • Analyst team manages DAGs independently from infrastructure changes

Constraints

  • Must run on existing cluster — zero new infrastructure budget
  • Analyst-owned DAG repository must stay isolated from infrastructure concerns

Architecture Design

Airflow deployed as Nomad services (scheduler, webserver, workers) with Vault-backed connection URIs. A pull-based DAG sync model isolates analyst workflows from infrastructure. Backup jobs run as Nomad periodic tasks — one per database, independent failure domains.

Architecture Diagram

Scroll horizontally on smaller screens to view full diagram

Component Breakdown

Apache Airflow

DAG orchestration — schedules and executes ETL and backup tasks

HashiCorp Vault

Supplies database connection strings at task execution time — zero credentials stored in Airflow or DAG files

PostgreSQL Sources

HR, Records Management, and Service Desk source databases feeding the ETL pipeline

Metabase

BI dashboard layer surfacing warehouse data to school leadership and operations staff

Backblaze B2

Off-site backup destination — encrypted nightly dumps from every production database

Key Design Decisions

Pull-based DAG sync from a separate analyst repository

Analysts own the DAG repository independently. A cron job on the Airflow worker pulls from the analyst repo — analysts can update and deploy DAGs without ever needing cluster access.

Vault-backed connection URIs over Airflow's connections UI

Airflow's connections UI stores credentials in its metadata database. Vault injection means credentials are fetched at task run time — not persisted anywhere in Airflow, not visible in logs.

Implementation Breakdown

01

Airflow on Nomad

Airflow scheduler, webserver, and Celery workers deployed as separate Nomad services with Vault-injected secrets.

  • Separate Nomad jobs for scheduler, webserver, and Celery workers — independent scaling
  • Vault policy grants Airflow read-only access to connection string secrets
  • Redis deployed as a Nomad service for the Celery broker
  • Airflow metadata PostgreSQL runs on the existing db-01 node
02

ETL DAGs

Incremental ETL DAGs pulling from source databases into a central warehouse schema daily.

  • Daily incremental loads with full-refresh fallback for schema changes
  • Schema validation step before loading to catch upstream structural changes
  • Alerting on DAG failure via webhook notification
  • Pull-based DAG sync — cron job pulls from analyst repo to worker every 5 minutes
03

Backup Pipeline

Nomad periodic jobs run pg_dump on a nightly schedule, encrypt, and ship to Backblaze B2.

  • One periodic Nomad job per database — independent failure domains, separate logs
  • GPG encryption before upload — backups are encrypted at rest in B2
  • Restore procedure tested against a separate PostgreSQL instance monthly
  • B2 lifecycle rules retain 30 days of daily backups

Challenges & Solutions

#1Airflow DB migration race condition on Nomad cluster restart

The Problem

Airflow runs database migrations on startup. On Nomad, after a cluster restart, the scheduler and webserver can start simultaneously — both attempt the migration and one fails with a lock conflict.

The Fix

Added a dedicated pre-start migration Nomad job that runs to completion before the main Airflow services. Nomad lifecycle prestart hooks ensure the migration finishes before any Airflow process starts.

Results & Impact

Gave the school its first analytics platform and eliminated data loss risk across all production databases.

Before → After

Database Backup Coverage

0%100%
Full coverage

Analytics Reporting

Manual, daysAutomated dashboards
Live insights

Credentials in DAG code

PlaintextVault-only
Zero exposure

Business Outcome

School leadership has live dashboards for enrollment, HR, and service desk operations. Any database can be restored to the previous night within 30 minutes.

Reflections

Would Do Differently

  • 01Add data quality checks (row counts, null rates) into the ETL pipeline from day one
  • 02Use Airflow's Vault backend plugin instead of template injection for a cleaner secrets integration

Key Takeaways

  • 01Running Airflow on Nomad requires careful attention to startup ordering — lifecycle prestart hooks are the right tool, not sleep timers
  • 02The pull-based DAG sync model was the right call — analysts iterate freely without ever needing cluster access or infrastructure knowledge

Next Project

Multi-Region DR Platform

Solutions Architecture

Multi-Region DR Platform

Thanks for Reading