# Building an African Economic Data Pipeline with Python, DuckDB & World Bank API

> Source: <https://dev.to/hajirufai/building-an-african-economic-data-pipeline-with-python-duckdb-world-bank-api-3al0>
> Published: 2026-05-23 12:41:33+00:00

Every data engineer knows the struggle: finding a project that's both technically impressive and genuinely useful. Today I'll walk you through **AfriData Pipeline** — a production-grade ETL system that extracts economic data for all 54 African countries, loads it into a DuckDB analytical warehouse, and serves an interactive dashboard.

No paid APIs. No cloud services required. Just Python, DuckDB, and free public data.

## Why This Project?

Africa's economy is growing fast, but finding clean, consolidated economic data is surprisingly hard. The World Bank has an amazing free API with 16,000+ indicators — but raw API responses need serious engineering to become useful.

This project demonstrates:

-
**ETL pipeline design** with proper error handling and retries -
**Dimensional modeling**(star schema) in DuckDB -
**Data quality engineering**— automated checks for completeness, validity, and freshness -
**Full-stack delivery**— from raw API to interactive dashboard

## Architecture Overview

```
World Bank API v2 → Extract (httpx) → Transform (Python) → Load (DuckDB)
                                                               ↓
                                            Export JSON → Static Dashboard (Vercel)
```

The pipeline processes **13,500 data points** (54 countries × 10 indicators × 25 years) in under 50 seconds.

## The Data: 10 Key Indicators

I selected indicators that tell a comprehensive economic story:

| Indicator | Category | Why It Matters |
|---|---|---|
| GDP (US$) | Economy | Total economic output |
| GDP Growth (%) | Economy | Economic momentum |
| Population | Demographics | Scale context |
| Inflation (CPI) | Economy | Cost of living pressure |
| Unemployment | Labor | Job market health |
| Life Expectancy | Health | Quality of life proxy |
| Internet Users (%) | Technology | Digital readiness |
| Electricity Access (%) | Infrastructure | Development foundation |
| Literacy Rate (%) | Education | Human capital |
| FDI Inflows (% GDP) | Investment | External confidence |

## Building the Extract Layer

The World Bank API v2 is beautifully simple — no auth required, JSON responses, and you can batch multiple countries in one request:

``` python
import httpx
import time

WB_BASE = "https://api.worldbank.org/v2"
MAX_RETRIES = 3

def extract_indicator(client: httpx.Client, indicator_code: str, 
                      country_codes: str) -> list[dict]:
    url = (f"{WB_BASE}/country/{country_codes}/indicator/{indicator_code}"
           f"?format=json&date=2000:2024&per_page=10000")

    for attempt in range(MAX_RETRIES):
        try:
            resp = client.get(url, timeout=60)
            resp.raise_for_status()
            data = resp.json()
            # World Bank returns [metadata, records]
            if isinstance(data, list) and len(data) == 2:
                return data[1] or []
        except (httpx.HTTPStatusError, httpx.ReadTimeout) as e:
            delay = 2 * (2 ** attempt)
            time.sleep(delay)
    return []
```

Key design decisions:

-
**Exponential backoff** on failures (2s, 4s, 8s) -
**Single request per indicator**— semicolon-separated country codes let us fetch all 54 countries at once -
**60-second timeout**— some indicators return large payloads -
**0.5s delay between indicators**— respect the free API

## The Star Schema

DuckDB is perfect for this: blazing fast analytics, zero configuration, and a single portable file.

```
dim_country ◄──── fact_indicators ────► dim_indicator
     │                  │
     └────────── dim_date ──────────────┘
python
import duckdb

def create_schema(conn):
    conn.execute("""
        CREATE TABLE IF NOT EXISTS fact_indicators (
            country_key  INTEGER,
            indicator_key INTEGER,
            date_key     INTEGER,
            value        DOUBLE,
            yoy_change   DOUBLE,
            extracted_at TIMESTAMP DEFAULT current_timestamp,
            PRIMARY KEY (country_key, indicator_key, date_key)
        )
    """)
    # Plus dim_country (54 rows), dim_indicator (10 rows), dim_date (25 rows)
```

The transform layer also computes **year-over-year change** for every data point:

``` python
def calculate_yoy(current, previous):
    if current is not None and previous is not None and previous != 0:
        return round(((current - previous) / abs(previous)) * 100, 2)
    return None
```

## Data Quality Framework

This is what separates a toy project from a production one. The quality framework scores three dimensions:

**1. Completeness** — What percentage of expected data points are non-null?

```
Literacy Rate: only 18% complete (data is sparse)
Population: 100% complete (every country, every year)
```

**2. Validity** — Are values within expected ranges?

```
Life expectancy: 25-95 years ✅
GDP: $1M - $10T ✅
Inflation: -30% to 10,000% (yes, hyperinflation happens) ✅
```

**3. Freshness** — How recent is the latest data?

```
GDP: 2024 ✅
Literacy: 2021 ⚠️ (surveys are infrequent)
```

The final score: **95.8/100** — with completeness dragging slightly due to sparse literacy data (expected for survey-based indicators).

## Interactive Dashboard

The dashboard is a static site (HTML + Tailwind CSS + Chart.js + Leaflet.js) that loads pre-exported JSON files:

**Features:**

- 🗺️
**Choropleth map**— click any African country, toggle between indicators - 📈
**Country comparison**— compare up to 6 countries over 25 years - 🏆
**Rankings table**— sortable by any indicator - 🌙
**Dark mode**— full theme support - 📱
**Responsive**— works on mobile

The dashboard reads four JSON files exported by the pipeline:

-
`country_profiles.json`

— all data per country (897KB) -
`rankings.json`

— pre-sorted rankings per indicator -
`summary_stats.json`

— aggregate statistics -
`quality_report.json`

— transparency on data quality

## Automated Daily Refresh

A GitHub Actions workflow runs the pipeline daily at 6 AM UTC:

```
name: Daily ETL Pipeline
on:
  schedule:
    - cron: '0 6 * * *'
  workflow_dispatch:

jobs:
  etl:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.12' }
      - run: pip install -r requirements.txt
      - run: python -m pipeline.main all
      - run: |
          git config user.name "github-actions[bot]"
          git add dashboard/data/
          git diff --cached --quiet || git commit -m "chore: update data"
          git push
```

Fresh data → committed JSON → Vercel auto-deploys. Zero manual intervention.

## Key Takeaways

**Free APIs are underrated**— The World Bank API has incredible depth. No auth, no rate limits worth worrying about, and 25+ years of history.** DuckDB is a game-changer**for small-to-medium analytical workloads. Zero setup, single file, and it handles 13K+ rows with analytical queries in milliseconds.**Data quality isn't optional**— Even with a trusted source like the World Bank, you'll find missing data, sparse indicators, and surprises. Build quality checks into the pipeline, not as an afterthought.**Static dashboards scale**— By pre-computing JSON at ETL time, the dashboard is just a static site. No backend, no database connection, no server costs. Deploy to Vercel for free.**Star schemas still matter**— Even in a world of data lakes and denormalized tables, dimensional modeling makes your data*queryable*and*understandable*.

## Try It Yourself

The entire project is open source:

-
**GitHub:**[hajirufai/afridata-pipeline](https://github.com/hajirufai/afridata-pipeline) -
**Stack:** Python 3.12, httpx, DuckDB, Chart.js, Leaflet.js, Tailwind CSS

```
git clone https://github.com/hajirufai/afridata-pipeline.git
cd afridata-pipeline
pip install -r requirements.txt
python -m pipeline.main all
cd dashboard && python -m http.server 8080
```

Data engineering doesn't have to be about massive Spark clusters and cloud bills. Sometimes the best projects start with a free API and a clear question.

*What economic indicators would you add? Drop a comment below!*