In the previous article, I extended a small Python data quality ETL starter with AI-ready data preparation.
The important constraint was that the workflow did not call an LLM API, generate embeddings, or train a model. It prepared structured data assets such as schema profiles, data dictionaries, validation summaries, feature-ready CSV files, and manifest files.
Previous article:
Preparing AI-Ready Data Without Calling an LLM API
This follow-up focuses on the v0.7.0 update of the same project:
Data Quality ETL Starter on GitHub
The new goal is to move beyond synthetic demo data and show that the same data quality workflow can process a public retail/e-commerce-style dataset locally.
This is still not a big data platform, a production retail analytics system, a benchmark leaderboard, or a public dataset redistribution repository.
The goal is narrower and more practical:
manually downloaded public retail dataset
β
prepare_real_dataset_demo.py
β
normalized retail transaction CSV
β
existing CLI validation and cleaning workflow
β
quality reports + SQLite export
β
run_real_dataset_benchmark.py
β
benchmark report + summary CSV outputs
That is a useful next step for a portfolio project because it shows the workflow can handle a more realistic dataset while still keeping data handling, scope, and reproducibility clear.
Earlier versions of this project used small sample files and generated synthetic order data.
That is useful for testing and documentation, but it leaves one practical question:
Can the workflow handle a public dataset that was not designed specifically for this repository?
v0.7.0 adds an optional real dataset benchmark path to answer that question.
The workflow now demonstrates how to:
The key design choice is that the existing CLI remains the source of truth.
The real dataset path does not become a separate pipeline. It prepares the source data, then passes it through the same validation and cleaning workflow used by the rest of the project.
The default v0.7.0 dataset is the UCI Online Retail dataset.
Official source:
UCI Machine Learning Repository: Online Retail
Citation:
Chen, D. (2015). Online Retail [Dataset]. UCI Machine Learning Repository.
https://doi.org/10.24432/C5BW33
License note:
Creative Commons Attribution 4.0 International (CC BY 4.0)
The dataset is useful for this project because it is retail/e-commerce adjacent and transaction-shaped. It includes fields that map naturally into an invoice/order-style workflow:
InvoiceNo
StockCode
Description
Quantity
InvoiceDate
UnitPrice
CustomerID
Country
The project maps those source columns into normalized snake_case
columns and adds derived fields.
This part is important.
The repository does not redistribute the full raw UCI dataset. It also does not commit full normalized or cleaned real dataset outputs.
These paths are local-only:
data/external/
data/raw/public/
data/output/real_dataset/
The repository keeps:
It does not keep:
This keeps the repository lightweight and avoids turning it into a dataset mirror.
The most relevant new files are:
scripts/prepare_real_dataset_demo.py
scripts/run_real_dataset_benchmark.py
src/dq_etl_starter/real_dataset.py
docs/data_sources.md
docs/real_dataset_benchmark.md
docs/limitations.md
data/expected/online_retail_schema.json
The real dataset helper module handles the project-specific mapping and summary logic.
The two scripts provide a simple local workflow:
The project now has a clearer path from messy input files to public-dataset benchmark evidence:
data-quality-etl-starter/
βββ data/
β βββ expected/
β β βββ online_retail_schema.json
β βββ output/
βββ docs/
β βββ data_sources.md
β βββ limitations.md
β βββ real_dataset_benchmark.md
βββ screenshots/
βββ scripts/
β βββ prepare_real_dataset_demo.py
β βββ run_real_dataset_benchmark.py
βββ src/dq_etl_starter/
β βββ real_dataset.py
β βββ cli.py
β βββ clean.py
β βββ report.py
β βββ validate.py
βββ tests/
βββ test_real_dataset.py
βββ test_real_dataset_benchmark.py
The real dataset path is optional. The default small sample workflows remain unchanged.
Clone the repository:
git clone https://github.com/OnerGit/data-quality-etl-starter.git
cd data-quality-etl-starter
Create a virtual environment:
python -m venv .venv
Activate it on macOS or Linux:
source .venv/bin/activate
Activate it on Windows PowerShell:
.venv\Scripts\activate
Install dependencies and the local package:
pip install -r requirements.txt
pip install -e .
The editable install step is useful because the project uses a src/
layout.
Download the UCI Online Retail dataset from the official UCI Machine Learning Repository page.
Place the file here:
data/external/online_retail.xlsx
The project does not automatically download the dataset by default.
That is intentional.
For a public portfolio repository, I prefer to keep the data acquisition step explicit. It makes the source, license, citation, and local-only handling policy easier to review.
Run the preparation script.
macOS / Linux:
python scripts/prepare_real_dataset_demo.py \
--raw-input data/external/online_retail.xlsx \
--output data/output/real_dataset/online_retail_normalized.csv
Windows PowerShell:
python scripts/prepare_real_dataset_demo.py `
--raw-input data/external/online_retail.xlsx `
--output data/output/real_dataset/online_retail_normalized.csv
This step reads the local source file, validates expected source columns, maps UCI columns into project-friendly names, derives additional fields, and writes a normalized CSV.
The normalized output columns are:
invoice_no
stock_code
description
quantity
invoice_date
unit_price
customer_id
country
revenue
is_cancellation
source_dataset
The derived fields are simple but useful:
revenue
is derived from quantity and unit price;is_cancellation
marks cancellation-style rows;source_dataset
records dataset lineage.This preparation layer is deliberately small. It does not try to perform all business logic. It only converts the external dataset into a shape that the existing project workflow can validate and clean.
After preparation, the normalized CSV is passed into the existing CLI workflow.
macOS / Linux:
python -m dq_etl_starter.cli run \
--input data/output/real_dataset/online_retail_normalized.csv \
--input-type csv \
--schema data/expected/online_retail_schema.json \
--output-dir data/output/real_dataset/run \
--db-target sqlite \
--table-name cleaned_online_retail
Windows PowerShell:
python -m dq_etl_starter.cli run `
--input data/output/real_dataset/online_retail_normalized.csv `
--input-type csv `
--schema data/expected/online_retail_schema.json `
--output-dir data/output/real_dataset/run `
--db-target sqlite `
--table-name cleaned_online_retail
Expected local outputs:
data/output/real_dataset/run/cleaned_online_retail.csv
data/output/real_dataset/run/etl_output.sqlite
data/output/real_dataset/run/quality_report.md
data/output/real_dataset/run/quality_report.json
This is the most important design point in v0.7.0.
The real dataset path reuses the existing validation and cleaning workflow. It does not create a special one-off script that bypasses the project architecture.
The schema file is:
data/expected/online_retail_schema.json
It defines the expected normalized columns and validation rules for fields such as invoice number, stock code, quantity, invoice date, unit price, customer ID, country, revenue, cancellation flag, and source dataset.
The schema is not intended to certify the dataset as business-ready.
It is a practical contract for this starter workflow:
external retail columns
β
normalized project columns
β
expected schema rules
β
quality report
That is a useful handoff pattern because the next person can inspect both the mapping and the validation report.
The CLI workflow writes a Markdown report and a JSON report.
For the real dataset workflow, the Markdown report is written to:
data/output/real_dataset/run/quality_report.md
The report is useful because it records what the workflow found rather than only producing a cleaned file.
Typical report sections include:
For client-style work, this is important. A cleaned output file alone is not enough. The workflow should also explain what was detected and what still needs review.
After the CLI workflow finishes, generate a local benchmark report and summary outputs.
macOS / Linux:
python scripts/run_real_dataset_benchmark.py \
--normalized-input data/output/real_dataset/online_retail_normalized.csv \
--quality-report data/output/real_dataset/run/quality_report.json \
--output-dir data/output/real_dataset \
--dataset-name uci_online_retail
Windows PowerShell:
python scripts/run_real_dataset_benchmark.py `
--normalized-input data/output/real_dataset/online_retail_normalized.csv `
--quality-report data/output/real_dataset/run/quality_report.json `
--output-dir data/output/real_dataset `
--dataset-name uci_online_retail
Expected local outputs:
data/output/real_dataset/benchmark_report.md
data/output/real_dataset/summary/revenue_by_country.csv
data/output/real_dataset/summary/revenue_by_month.csv
data/output/real_dataset/summary/cancellation_summary.csv
data/output/real_dataset/summary/missing_customer_summary.csv
The benchmark report is not a universal performance claim.
It is local evidence for this machine, this dependency environment, and this dataset preparation flow.
That distinction matters. Runtime can change depending on CPU, disk speed, Python version, package versions, source file format, operating system, and local machine conditions.
The benchmark script also writes lightweight summary CSV files.
The summary outputs are intentionally simple:
revenue_by_country.csv
revenue_by_month.csv
cancellation_summary.csv
missing_customer_summary.csv
They are not a full BI model.
They are small reporting-ready outputs that show how a cleaned retail transaction dataset can be summarized after validation.
For example:
revenue_by_country.csv
supports country-level revenue inspection;revenue_by_month.csv
supports monthly trend inspection;cancellation_summary.csv
records cancellation and non-positive row counters;missing_customer_summary.csv
helps inspect where customer IDs are missing.This is often enough for a first data workflow milestone.
The next version could load these into PostgreSQL, query them in DuckDB, or feed a local dashboard, but v0.7.0 intentionally keeps the real dataset path focused.
The benchmark report is designed to answer practical questions:
That makes the run easier to review later.
It also makes the project stronger as a portfolio asset because the workflow is not only described in prose. It leaves behind files, screenshots, reports, and commands that can be inspected.
The project could theoretically download the dataset automatically.
For this version, I chose not to do that.
Manual download keeps the workflow clearer:
For a small portfolio repository, this is a reasonable trade-off.
The project demonstrates how to process the dataset, not how to become a dataset distribution tool.
Run the full test suite:
python -m compileall -q src/dq_etl_starter
python -m compileall -q scripts
pytest
Run the v0.7-related tests:
pytest tests/test_real_dataset.py
pytest tests/test_real_dataset_benchmark.py
The tests focus on the reusable code paths rather than requiring the full external dataset to be committed.
That is another useful pattern for public repositories: test the transformation logic with small fixtures, and keep the large external dataset local-only.
The v0.7.0 real dataset benchmark does not add:
This project is still a small Python data workflow starter.
The v0.7.0 update proves a specific point: the workflow can be applied to a public retail transaction dataset locally, while keeping the data handling policy, validation steps, outputs, and limitations clear.
Possible next improvements include:
The main constraint remains the same:
Keep the project small, reproducible, inspectable, and easy to adapt.
GitHub repository:
https://github.com/OnerGit/data-quality-etl-starter
Previous article:
Preparing AI-Ready Data Without Calling an LLM API
This v0.7.0 update is a practical next step: from synthetic and generated demos to a local public retail dataset benchmark that reuses the same validation, cleaning, reporting, and handoff workflow.