{"slug": "running-a-real-retail-dataset-through-a-python-data-quality-workflow", "title": "Running a Real Retail Dataset Through a Python Data Quality Workflow", "summary": "The Data Quality ETL Starter project v0.7.0 now processes the UCI Online Retail dataset, demonstrating its ability to handle real-world data. The workflow prepares the public dataset locally, runs it through the existing CLI validation and cleaning pipeline, and generates quality reports and benchmark summaries without redistributing the raw data.", "body_md": "In the previous article, I extended a small Python data quality ETL starter with AI-ready data preparation.\n\nThe important constraint was that the workflow did not call an LLM API, generate embeddings, or train a model. It prepared structured data assets such as schema profiles, data dictionaries, validation summaries, feature-ready CSV files, and manifest files.\n\nPrevious article:\n\n[Preparing AI-Ready Data Without Calling an LLM API](https://dev.to/bob_oner/preparing-ai-ready-data-without-calling-an-llm-api-5daf)\n\nThis follow-up focuses on the v0.7.0 update of the same project:\n\n[Data Quality ETL Starter on GitHub](https://github.com/OnerGit/data-quality-etl-starter)\n\nThe new goal is to move beyond synthetic demo data and show that the same data quality workflow can process a public retail/e-commerce-style dataset locally.\n\nThis is still not a big data platform, a production retail analytics system, a benchmark leaderboard, or a public dataset redistribution repository.\n\nThe goal is narrower and more practical:\n\n```\nmanually downloaded public retail dataset\n        ↓\nprepare_real_dataset_demo.py\n        ↓\nnormalized retail transaction CSV\n        ↓\nexisting CLI validation and cleaning workflow\n        ↓\nquality reports + SQLite export\n        ↓\nrun_real_dataset_benchmark.py\n        ↓\nbenchmark report + summary CSV outputs\n```\n\nThat is a useful next step for a portfolio project because it shows the workflow can handle a more realistic dataset while still keeping data handling, scope, and reproducibility clear.\n\nEarlier versions of this project used small sample files and generated synthetic order data.\n\nThat is useful for testing and documentation, but it leaves one practical question:\n\nCan the workflow handle a public dataset that was not designed specifically for this repository?\n\nv0.7.0 adds an optional real dataset benchmark path to answer that question.\n\nThe workflow now demonstrates how to:\n\nThe key design choice is that the existing CLI remains the source of truth.\n\nThe real dataset path does not become a separate pipeline. It prepares the source data, then passes it through the same validation and cleaning workflow used by the rest of the project.\n\nThe default v0.7.0 dataset is the UCI Online Retail dataset.\n\nOfficial source:\n\n[UCI Machine Learning Repository: Online Retail](https://archive.ics.uci.edu/dataset/352/online%2Bretail)\n\nCitation:\n\n```\nChen, D. (2015). Online Retail [Dataset]. UCI Machine Learning Repository.\nhttps://doi.org/10.24432/C5BW33\n```\n\nLicense note:\n\n```\nCreative Commons Attribution 4.0 International (CC BY 4.0)\n```\n\nThe dataset is useful for this project because it is retail/e-commerce adjacent and transaction-shaped. It includes fields that map naturally into an invoice/order-style workflow:\n\n```\nInvoiceNo\nStockCode\nDescription\nQuantity\nInvoiceDate\nUnitPrice\nCustomerID\nCountry\n```\n\nThe project maps those source columns into normalized `snake_case`\n\ncolumns and adds derived fields.\n\nThis part is important.\n\nThe repository does not redistribute the full raw UCI dataset. It also does not commit full normalized or cleaned real dataset outputs.\n\nThese paths are local-only:\n\n```\ndata/external/\ndata/raw/public/\ndata/output/real_dataset/\n```\n\nThe repository keeps:\n\nIt does not keep:\n\nThis keeps the repository lightweight and avoids turning it into a dataset mirror.\n\nThe most relevant new files are:\n\n```\nscripts/prepare_real_dataset_demo.py\nscripts/run_real_dataset_benchmark.py\nsrc/dq_etl_starter/real_dataset.py\ndocs/data_sources.md\ndocs/real_dataset_benchmark.md\ndocs/limitations.md\ndata/expected/online_retail_schema.json\n```\n\nThe real dataset helper module handles the project-specific mapping and summary logic.\n\nThe two scripts provide a simple local workflow:\n\nThe project now has a clearer path from messy input files to public-dataset benchmark evidence:\n\n```\ndata-quality-etl-starter/\n├── data/\n│   ├── expected/\n│   │   └── online_retail_schema.json\n│   └── output/\n├── docs/\n│   ├── data_sources.md\n│   ├── limitations.md\n│   └── real_dataset_benchmark.md\n├── screenshots/\n├── scripts/\n│   ├── prepare_real_dataset_demo.py\n│   └── run_real_dataset_benchmark.py\n├── src/dq_etl_starter/\n│   ├── real_dataset.py\n│   ├── cli.py\n│   ├── clean.py\n│   ├── report.py\n│   └── validate.py\n└── tests/\n    ├── test_real_dataset.py\n    └── test_real_dataset_benchmark.py\n```\n\nThe real dataset path is optional. The default small sample workflows remain unchanged.\n\nClone the repository:\n\n```\ngit clone https://github.com/OnerGit/data-quality-etl-starter.git\ncd data-quality-etl-starter\n```\n\nCreate a virtual environment:\n\n```\npython -m venv .venv\n```\n\nActivate it on macOS or Linux:\n\n```\nsource .venv/bin/activate\n```\n\nActivate it on Windows PowerShell:\n\n```\n.venv\\Scripts\\activate\n```\n\nInstall dependencies and the local package:\n\n```\npip install -r requirements.txt\npip install -e .\n```\n\nThe editable install step is useful because the project uses a `src/`\n\nlayout.\n\nDownload the UCI Online Retail dataset from the official UCI Machine Learning Repository page.\n\nPlace the file here:\n\n```\ndata/external/online_retail.xlsx\n```\n\nThe project does not automatically download the dataset by default.\n\nThat is intentional.\n\nFor a public portfolio repository, I prefer to keep the data acquisition step explicit. It makes the source, license, citation, and local-only handling policy easier to review.\n\nRun the preparation script.\n\nmacOS / Linux:\n\n```\npython scripts/prepare_real_dataset_demo.py \\\n  --raw-input data/external/online_retail.xlsx \\\n  --output data/output/real_dataset/online_retail_normalized.csv\n```\n\nWindows PowerShell:\n\n```\npython scripts/prepare_real_dataset_demo.py `\n  --raw-input data/external/online_retail.xlsx `\n  --output data/output/real_dataset/online_retail_normalized.csv\n```\n\nThis step reads the local source file, validates expected source columns, maps UCI columns into project-friendly names, derives additional fields, and writes a normalized CSV.\n\nThe normalized output columns are:\n\n```\ninvoice_no\nstock_code\ndescription\nquantity\ninvoice_date\nunit_price\ncustomer_id\ncountry\nrevenue\nis_cancellation\nsource_dataset\n```\n\nThe derived fields are simple but useful:\n\n`revenue`\n\nis derived from quantity and unit price;`is_cancellation`\n\nmarks cancellation-style rows;`source_dataset`\n\nrecords dataset lineage.This preparation layer is deliberately small. It does not try to perform all business logic. It only converts the external dataset into a shape that the existing project workflow can validate and clean.\n\nAfter preparation, the normalized CSV is passed into the existing CLI workflow.\n\nmacOS / Linux:\n\n```\npython -m dq_etl_starter.cli run \\\n  --input data/output/real_dataset/online_retail_normalized.csv \\\n  --input-type csv \\\n  --schema data/expected/online_retail_schema.json \\\n  --output-dir data/output/real_dataset/run \\\n  --db-target sqlite \\\n  --table-name cleaned_online_retail\n```\n\nWindows PowerShell:\n\n```\npython -m dq_etl_starter.cli run `\n  --input data/output/real_dataset/online_retail_normalized.csv `\n  --input-type csv `\n  --schema data/expected/online_retail_schema.json `\n  --output-dir data/output/real_dataset/run `\n  --db-target sqlite `\n  --table-name cleaned_online_retail\n```\n\nExpected local outputs:\n\n```\ndata/output/real_dataset/run/cleaned_online_retail.csv\ndata/output/real_dataset/run/etl_output.sqlite\ndata/output/real_dataset/run/quality_report.md\ndata/output/real_dataset/run/quality_report.json\n```\n\nThis is the most important design point in v0.7.0.\n\nThe real dataset path reuses the existing validation and cleaning workflow. It does not create a special one-off script that bypasses the project architecture.\n\nThe schema file is:\n\n```\ndata/expected/online_retail_schema.json\n```\n\nIt defines the expected normalized columns and validation rules for fields such as invoice number, stock code, quantity, invoice date, unit price, customer ID, country, revenue, cancellation flag, and source dataset.\n\nThe schema is not intended to certify the dataset as business-ready.\n\nIt is a practical contract for this starter workflow:\n\n```\nexternal retail columns\n        ↓\nnormalized project columns\n        ↓\nexpected schema rules\n        ↓\nquality report\n```\n\nThat is a useful handoff pattern because the next person can inspect both the mapping and the validation report.\n\nThe CLI workflow writes a Markdown report and a JSON report.\n\nFor the real dataset workflow, the Markdown report is written to:\n\n```\ndata/output/real_dataset/run/quality_report.md\n```\n\nThe report is useful because it records what the workflow found rather than only producing a cleaned file.\n\nTypical report sections include:\n\nFor client-style work, this is important. A cleaned output file alone is not enough. The workflow should also explain what was detected and what still needs review.\n\nAfter the CLI workflow finishes, generate a local benchmark report and summary outputs.\n\nmacOS / Linux:\n\n```\npython scripts/run_real_dataset_benchmark.py \\\n  --normalized-input data/output/real_dataset/online_retail_normalized.csv \\\n  --quality-report data/output/real_dataset/run/quality_report.json \\\n  --output-dir data/output/real_dataset \\\n  --dataset-name uci_online_retail\n```\n\nWindows PowerShell:\n\n```\npython scripts/run_real_dataset_benchmark.py `\n  --normalized-input data/output/real_dataset/online_retail_normalized.csv `\n  --quality-report data/output/real_dataset/run/quality_report.json `\n  --output-dir data/output/real_dataset `\n  --dataset-name uci_online_retail\n```\n\nExpected local outputs:\n\n```\ndata/output/real_dataset/benchmark_report.md\ndata/output/real_dataset/summary/revenue_by_country.csv\ndata/output/real_dataset/summary/revenue_by_month.csv\ndata/output/real_dataset/summary/cancellation_summary.csv\ndata/output/real_dataset/summary/missing_customer_summary.csv\n```\n\nThe benchmark report is not a universal performance claim.\n\nIt is local evidence for this machine, this dependency environment, and this dataset preparation flow.\n\nThat distinction matters. Runtime can change depending on CPU, disk speed, Python version, package versions, source file format, operating system, and local machine conditions.\n\nThe benchmark script also writes lightweight summary CSV files.\n\nThe summary outputs are intentionally simple:\n\n```\nrevenue_by_country.csv\nrevenue_by_month.csv\ncancellation_summary.csv\nmissing_customer_summary.csv\n```\n\nThey are not a full BI model.\n\nThey are small reporting-ready outputs that show how a cleaned retail transaction dataset can be summarized after validation.\n\nFor example:\n\n`revenue_by_country.csv`\n\nsupports country-level revenue inspection;`revenue_by_month.csv`\n\nsupports monthly trend inspection;`cancellation_summary.csv`\n\nrecords cancellation and non-positive row counters;`missing_customer_summary.csv`\n\nhelps inspect where customer IDs are missing.This is often enough for a first data workflow milestone.\n\nThe next version could load these into PostgreSQL, query them in DuckDB, or feed a local dashboard, but v0.7.0 intentionally keeps the real dataset path focused.\n\nThe benchmark report is designed to answer practical questions:\n\nThat makes the run easier to review later.\n\nIt also makes the project stronger as a portfolio asset because the workflow is not only described in prose. It leaves behind files, screenshots, reports, and commands that can be inspected.\n\nThe project could theoretically download the dataset automatically.\n\nFor this version, I chose not to do that.\n\nManual download keeps the workflow clearer:\n\nFor a small portfolio repository, this is a reasonable trade-off.\n\nThe project demonstrates how to process the dataset, not how to become a dataset distribution tool.\n\nRun the full test suite:\n\n```\npython -m compileall -q src/dq_etl_starter\npython -m compileall -q scripts\npytest\n```\n\nRun the v0.7-related tests:\n\n```\npytest tests/test_real_dataset.py\npytest tests/test_real_dataset_benchmark.py\n```\n\nThe tests focus on the reusable code paths rather than requiring the full external dataset to be committed.\n\nThat is another useful pattern for public repositories: test the transformation logic with small fixtures, and keep the large external dataset local-only.\n\nThe v0.7.0 real dataset benchmark does not add:\n\nThis project is still a small Python data workflow starter.\n\nThe v0.7.0 update proves a specific point: the workflow can be applied to a public retail transaction dataset locally, while keeping the data handling policy, validation steps, outputs, and limitations clear.\n\nPossible next improvements include:\n\nThe main constraint remains the same:\n\nKeep the project small, reproducible, inspectable, and easy to adapt.\n\nGitHub repository:\n\n[https://github.com/OnerGit/data-quality-etl-starter](https://github.com/OnerGit/data-quality-etl-starter)\n\nPrevious article:\n\n[Preparing AI-Ready Data Without Calling an LLM API](https://dev.to/bob_oner/preparing-ai-ready-data-without-calling-an-llm-api-5daf)\n\nThis v0.7.0 update is a practical next step: from synthetic and generated demos to a local public retail dataset benchmark that reuses the same validation, cleaning, reporting, and handoff workflow.", "url": "https://wpnews.pro/news/running-a-real-retail-dataset-through-a-python-data-quality-workflow", "canonical_source": "https://dev.to/bob_oner/running-a-real-retail-dataset-through-a-python-data-quality-workflow-490b", "published_at": "2026-06-16 02:54:17+00:00", "updated_at": "2026-06-16 03:17:26.780160+00:00", "lang": "en", "topics": ["developer-tools"], "entities": ["Data Quality ETL Starter", "UCI Machine Learning Repository", "Online Retail dataset", "Chen, D.", "GitHub", "OnerGit"], "alternates": {"html": "https://wpnews.pro/news/running-a-real-retail-dataset-through-a-python-data-quality-workflow", "markdown": "https://wpnews.pro/news/running-a-real-retail-dataset-through-a-python-data-quality-workflow.md", "text": "https://wpnews.pro/news/running-a-real-retail-dataset-through-a-python-data-quality-workflow.txt", "jsonld": "https://wpnews.pro/news/running-a-real-retail-dataset-through-a-python-data-quality-workflow.jsonld"}}