Building an African Economic Data Pipeline with Python, DuckDB & World Bank API

The article describes the AfriData Pipeline, a production-grade ETL system built with Python, DuckDB, and the free World Bank API that extracts and processes economic data for all 54 African countries. The pipeline handles 13,500 data points in under 50 seconds, computes year-over-year changes, and includes a data quality framework that scored 95.8/100. The system exports data to JSON files for a static interactive dashboard and runs automatically via a daily GitHub Actions workflow.

Every data engineer knows the struggle: finding a project that's both technically impressive and genuinely useful. Today I'll walk you through AfriData Pipeline — a production-grade ETL system that extracts economic data for all 54 African countries, loads it into a DuckDB analytical warehouse, and serves an interactive dashboard. No paid APIs. No cloud services required. Just Python, DuckDB, and free public data. Why This Project? Africa's economy is growing fast, but finding clean, consolidated economic data is surprisingly hard. The World Bank has an amazing free API with 16,000+ indicators — but raw API responses need serious engineering to become useful. This project demonstrates: - ETL pipeline design with proper error handling and retries - Dimensional modeling star schema in DuckDB - Data quality engineering — automated checks for completeness, validity, and freshness - Full-stack delivery — from raw API to interactive dashboard Architecture Overview World Bank API v2 → Extract httpx → Transform Python → Load DuckDB ↓ Export JSON → Static Dashboard Vercel The pipeline processes 13,500 data points 54 countries × 10 indicators × 25 years in under 50 seconds. The Data: 10 Key Indicators I selected indicators that tell a comprehensive economic story: | Indicator | Category | Why It Matters | |---|---|---| | GDP US$ | Economy | Total economic output | | GDP Growth % | Economy | Economic momentum | | Population | Demographics | Scale context | | Inflation CPI | Economy | Cost of living pressure | | Unemployment | Labor | Job market health | | Life Expectancy | Health | Quality of life proxy | | Internet Users % | Technology | Digital readiness | | Electricity Access % | Infrastructure | Development foundation | | Literacy Rate % | Education | Human capital | | FDI Inflows % GDP | Investment | External confidence | Building the Extract Layer The World Bank API v2 is beautifully simple — no auth required, JSON responses, and you can batch multiple countries in one request: python import httpx import time WB BASE = "https://api.worldbank.org/v2" MAX RETRIES = 3 def extract indicator client: httpx.Client, indicator code: str, country codes: str - list dict : url = f"{WB BASE}/country/{country codes}/indicator/{indicator code}" f"?format=json&date=2000:2024&per page=10000" for attempt in range MAX RETRIES : try: resp = client.get url, timeout=60 resp.raise for status data = resp.json World Bank returns metadata, records if isinstance data, list and len data == 2: return data 1 or except httpx.HTTPStatusError, httpx.ReadTimeout as e: delay = 2 2 attempt time.sleep delay return Key design decisions: - Exponential backoff on failures 2s, 4s, 8s - Single request per indicator — semicolon-separated country codes let us fetch all 54 countries at once - 60-second timeout — some indicators return large payloads - 0.5s delay between indicators — respect the free API The Star Schema DuckDB is perfect for this: blazing fast analytics, zero configuration, and a single portable file. dim country ◄──── fact indicators ────► dim indicator │ │ └────────── dim date ──────────────┘ python import duckdb def create schema conn : conn.execute """ CREATE TABLE IF NOT EXISTS fact indicators country key INTEGER, indicator key INTEGER, date key INTEGER, value DOUBLE, yoy change DOUBLE, extracted at TIMESTAMP DEFAULT current timestamp, PRIMARY KEY country key, indicator key, date key """ Plus dim country 54 rows , dim indicator 10 rows , dim date 25 rows The transform layer also computes year-over-year change for every data point: python def calculate yoy current, previous : if current is not None and previous is not None and previous = 0: return round current - previous / abs previous 100, 2 return None Data Quality Framework This is what separates a toy project from a production one. The quality framework scores three dimensions: 1. Completeness — What percentage of expected data points are non-null? Literacy Rate: only 18% complete data is sparse Population: 100% complete every country, every year 2. Validity — Are values within expected ranges? Life expectancy: 25-95 years ✅ GDP: $1M - $10T ✅ Inflation: -30% to 10,000% yes, hyperinflation happens ✅ 3. Freshness — How recent is the latest data? GDP: 2024 ✅ Literacy: 2021 ⚠️ surveys are infrequent The final score: 95.8/100 — with completeness dragging slightly due to sparse literacy data expected for survey-based indicators . Interactive Dashboard The dashboard is a static site HTML + Tailwind CSS + Chart.js + Leaflet.js that loads pre-exported JSON files: Features: - 🗺️ Choropleth map — click any African country, toggle between indicators - 📈 Country comparison — compare up to 6 countries over 25 years - 🏆 Rankings table — sortable by any indicator - 🌙 Dark mode — full theme support - 📱 Responsive — works on mobile The dashboard reads four JSON files exported by the pipeline: - country profiles.json — all data per country 897KB - rankings.json — pre-sorted rankings per indicator - summary stats.json — aggregate statistics - quality report.json — transparency on data quality Automated Daily Refresh A GitHub Actions workflow runs the pipeline daily at 6 AM UTC: name: Daily ETL Pipeline on: schedule: - cron: '0 6 ' workflow dispatch: jobs: etl: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: { python-version: '3.12' } - run: pip install -r requirements.txt - run: python -m pipeline.main all - run: | git config user.name "github-actions bot " git add dashboard/data/ git diff --cached --quiet || git commit -m "chore: update data" git push Fresh data → committed JSON → Vercel auto-deploys. Zero manual intervention. Key Takeaways Free APIs are underrated — The World Bank API has incredible depth. No auth, no rate limits worth worrying about, and 25+ years of history. DuckDB is a game-changer for small-to-medium analytical workloads. Zero setup, single file, and it handles 13K+ rows with analytical queries in milliseconds. Data quality isn't optional — Even with a trusted source like the World Bank, you'll find missing data, sparse indicators, and surprises. Build quality checks into the pipeline, not as an afterthought. Static dashboards scale — By pre-computing JSON at ETL time, the dashboard is just a static site. No backend, no database connection, no server costs. Deploy to Vercel for free. Star schemas still matter — Even in a world of data lakes and denormalized tables, dimensional modeling makes your data queryable and understandable . Try It Yourself The entire project is open source: - GitHub: hajirufai/afridata-pipeline https://github.com/hajirufai/afridata-pipeline - Stack: Python 3.12, httpx, DuckDB, Chart.js, Leaflet.js, Tailwind CSS git clone https://github.com/hajirufai/afridata-pipeline.git cd afridata-pipeline pip install -r requirements.txt python -m pipeline.main all cd dashboard && python -m http.server 8080 Data engineering doesn't have to be about massive Spark clusters and cloud bills. Sometimes the best projects start with a free API and a clear question. What economic indicators would you add? Drop a comment below