Running a Real Retail Dataset Through a Python Data Quality Workflow

The Data Quality ETL Starter project v0.7.0 now processes the UCI Online Retail dataset, demonstrating its ability to handle real-world data. The workflow prepares the public dataset locally, runs it through the existing CLI validation and cleaning pipeline, and generates quality reports and benchmark summaries without redistributing the raw data.

In the previous article, I extended a small Python data quality ETL starter with AI-ready data preparation. The important constraint was that the workflow did not call an LLM API, generate embeddings, or train a model. It prepared structured data assets such as schema profiles, data dictionaries, validation summaries, feature-ready CSV files, and manifest files. Previous article: Preparing AI-Ready Data Without Calling an LLM API https://dev.to/bob oner/preparing-ai-ready-data-without-calling-an-llm-api-5daf This follow-up focuses on the v0.7.0 update of the same project: Data Quality ETL Starter on GitHub https://github.com/OnerGit/data-quality-etl-starter The new goal is to move beyond synthetic demo data and show that the same data quality workflow can process a public retail/e-commerce-style dataset locally. This is still not a big data platform, a production retail analytics system, a benchmark leaderboard, or a public dataset redistribution repository. The goal is narrower and more practical: manually downloaded public retail dataset ↓ prepare real dataset demo.py ↓ normalized retail transaction CSV ↓ existing CLI validation and cleaning workflow ↓ quality reports + SQLite export ↓ run real dataset benchmark.py ↓ benchmark report + summary CSV outputs That is a useful next step for a portfolio project because it shows the workflow can handle a more realistic dataset while still keeping data handling, scope, and reproducibility clear. Earlier versions of this project used small sample files and generated synthetic order data. That is useful for testing and documentation, but it leaves one practical question: Can the workflow handle a public dataset that was not designed specifically for this repository? v0.7.0 adds an optional real dataset benchmark path to answer that question. The workflow now demonstrates how to: The key design choice is that the existing CLI remains the source of truth. The real dataset path does not become a separate pipeline. It prepares the source data, then passes it through the same validation and cleaning workflow used by the rest of the project. The default v0.7.0 dataset is the UCI Online Retail dataset. Official source: UCI Machine Learning Repository: Online Retail https://archive.ics.uci.edu/dataset/352/online%2Bretail Citation: Chen, D. 2015 . Online Retail Dataset . UCI Machine Learning Repository. https://doi.org/10.24432/C5BW33 License note: Creative Commons Attribution 4.0 International CC BY 4.0 The dataset is useful for this project because it is retail/e-commerce adjacent and transaction-shaped. It includes fields that map naturally into an invoice/order-style workflow: InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country The project maps those source columns into normalized snake case columns and adds derived fields. This part is important. The repository does not redistribute the full raw UCI dataset. It also does not commit full normalized or cleaned real dataset outputs. These paths are local-only: data/external/ data/raw/public/ data/output/real dataset/ The repository keeps: It does not keep: This keeps the repository lightweight and avoids turning it into a dataset mirror. The most relevant new files are: scripts/prepare real dataset demo.py scripts/run real dataset benchmark.py src/dq etl starter/real dataset.py docs/data sources.md docs/real dataset benchmark.md docs/limitations.md data/expected/online retail schema.json The real dataset helper module handles the project-specific mapping and summary logic. The two scripts provide a simple local workflow: The project now has a clearer path from messy input files to public-dataset benchmark evidence: data-quality-etl-starter/ ├── data/ │ ├── expected/ │ │ └── online retail schema.json │ └── output/ ├── docs/ │ ├── data sources.md │ ├── limitations.md │ └── real dataset benchmark.md ├── screenshots/ ├── scripts/ │ ├── prepare real dataset demo.py │ └── run real dataset benchmark.py ├── src/dq etl starter/ │ ├── real dataset.py │ ├── cli.py │ ├── clean.py │ ├── report.py │ └── validate.py └── tests/ ├── test real dataset.py └── test real dataset benchmark.py The real dataset path is optional. The default small sample workflows remain unchanged. Clone the repository: git clone https://github.com/OnerGit/data-quality-etl-starter.git cd data-quality-etl-starter Create a virtual environment: python -m venv .venv Activate it on macOS or Linux: source .venv/bin/activate Activate it on Windows PowerShell: .venv\Scripts\activate Install dependencies and the local package: pip install -r requirements.txt pip install -e . The editable install step is useful because the project uses a src/ layout. Download the UCI Online Retail dataset from the official UCI Machine Learning Repository page. Place the file here: data/external/online retail.xlsx The project does not automatically download the dataset by default. That is intentional. For a public portfolio repository, I prefer to keep the data acquisition step explicit. It makes the source, license, citation, and local-only handling policy easier to review. Run the preparation script. macOS / Linux: python scripts/prepare real dataset demo.py \ --raw-input data/external/online retail.xlsx \ --output data/output/real dataset/online retail normalized.csv Windows PowerShell: python scripts/prepare real dataset demo.py --raw-input data/external/online retail.xlsx --output data/output/real dataset/online retail normalized.csv This step reads the local source file, validates expected source columns, maps UCI columns into project-friendly names, derives additional fields, and writes a normalized CSV. The normalized output columns are: invoice no stock code description quantity invoice date unit price customer id country revenue is cancellation source dataset The derived fields are simple but useful: revenue is derived from quantity and unit price; is cancellation marks cancellation-style rows; source dataset records dataset lineage.This preparation layer is deliberately small. It does not try to perform all business logic. It only converts the external dataset into a shape that the existing project workflow can validate and clean. After preparation, the normalized CSV is passed into the existing CLI workflow. macOS / Linux: python -m dq etl starter.cli run \ --input data/output/real dataset/online retail normalized.csv \ --input-type csv \ --schema data/expected/online retail schema.json \ --output-dir data/output/real dataset/run \ --db-target sqlite \ --table-name cleaned online retail Windows PowerShell: python -m dq etl starter.cli run --input data/output/real dataset/online retail normalized.csv --input-type csv --schema data/expected/online retail schema.json --output-dir data/output/real dataset/run --db-target sqlite --table-name cleaned online retail Expected local outputs: data/output/real dataset/run/cleaned online retail.csv data/output/real dataset/run/etl output.sqlite data/output/real dataset/run/quality report.md data/output/real dataset/run/quality report.json This is the most important design point in v0.7.0. The real dataset path reuses the existing validation and cleaning workflow. It does not create a special one-off script that bypasses the project architecture. The schema file is: data/expected/online retail schema.json It defines the expected normalized columns and validation rules for fields such as invoice number, stock code, quantity, invoice date, unit price, customer ID, country, revenue, cancellation flag, and source dataset. The schema is not intended to certify the dataset as business-ready. It is a practical contract for this starter workflow: external retail columns ↓ normalized project columns ↓ expected schema rules ↓ quality report That is a useful handoff pattern because the next person can inspect both the mapping and the validation report. The CLI workflow writes a Markdown report and a JSON report. For the real dataset workflow, the Markdown report is written to: data/output/real dataset/run/quality report.md The report is useful because it records what the workflow found rather than only producing a cleaned file. Typical report sections include: For client-style work, this is important. A cleaned output file alone is not enough. The workflow should also explain what was detected and what still needs review. After the CLI workflow finishes, generate a local benchmark report and summary outputs. macOS / Linux: python scripts/run real dataset benchmark.py \ --normalized-input data/output/real dataset/online retail normalized.csv \ --quality-report data/output/real dataset/run/quality report.json \ --output-dir data/output/real dataset \ --dataset-name uci online retail Windows PowerShell: python scripts/run real dataset benchmark.py --normalized-input data/output/real dataset/online retail normalized.csv --quality-report data/output/real dataset/run/quality report.json --output-dir data/output/real dataset --dataset-name uci online retail Expected local outputs: data/output/real dataset/benchmark report.md data/output/real dataset/summary/revenue by country.csv data/output/real dataset/summary/revenue by month.csv data/output/real dataset/summary/cancellation summary.csv data/output/real dataset/summary/missing customer summary.csv The benchmark report is not a universal performance claim. It is local evidence for this machine, this dependency environment, and this dataset preparation flow. That distinction matters. Runtime can change depending on CPU, disk speed, Python version, package versions, source file format, operating system, and local machine conditions. The benchmark script also writes lightweight summary CSV files. The summary outputs are intentionally simple: revenue by country.csv revenue by month.csv cancellation summary.csv missing customer summary.csv They are not a full BI model. They are small reporting-ready outputs that show how a cleaned retail transaction dataset can be summarized after validation. For example: revenue by country.csv supports country-level revenue inspection; revenue by month.csv supports monthly trend inspection; cancellation summary.csv records cancellation and non-positive row counters; missing customer summary.csv helps inspect where customer IDs are missing.This is often enough for a first data workflow milestone. The next version could load these into PostgreSQL, query them in DuckDB, or feed a local dashboard, but v0.7.0 intentionally keeps the real dataset path focused. The benchmark report is designed to answer practical questions: That makes the run easier to review later. It also makes the project stronger as a portfolio asset because the workflow is not only described in prose. It leaves behind files, screenshots, reports, and commands that can be inspected. The project could theoretically download the dataset automatically. For this version, I chose not to do that. Manual download keeps the workflow clearer: For a small portfolio repository, this is a reasonable trade-off. The project demonstrates how to process the dataset, not how to become a dataset distribution tool. Run the full test suite: python -m compileall -q src/dq etl starter python -m compileall -q scripts pytest Run the v0.7-related tests: pytest tests/test real dataset.py pytest tests/test real dataset benchmark.py The tests focus on the reusable code paths rather than requiring the full external dataset to be committed. That is another useful pattern for public repositories: test the transformation logic with small fixtures, and keep the large external dataset local-only. The v0.7.0 real dataset benchmark does not add: This project is still a small Python data workflow starter. The v0.7.0 update proves a specific point: the workflow can be applied to a public retail transaction dataset locally, while keeping the data handling policy, validation steps, outputs, and limitations clear. Possible next improvements include: The main constraint remains the same: Keep the project small, reproducible, inspectable, and easy to adapt. GitHub repository: https://github.com/OnerGit/data-quality-etl-starter https://github.com/OnerGit/data-quality-etl-starter Previous article: Preparing AI-Ready Data Without Calling an LLM API https://dev.to/bob oner/preparing-ai-ready-data-without-calling-an-llm-api-5daf This v0.7.0 update is a practical next step: from synthetic and generated demos to a local public retail dataset benchmark that reuses the same validation, cleaning, reporting, and handoff workflow.