Skip to content

DenisMilanov/receipts-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Receipts Analytics — End-to-End Project

Open in GitHub

SoftUni · Data Science (Exam Project)
Author: Denis Milanov
Aim: Demonstrate a reproducible, end-to-end receipts analytics pipeline — ingest → cleaning → entity resolution → offline geocoding → holiday & CPI enrichment → merchant categorization (rules + optional ML) → analytics report.
Notes: Educational use only; no external APIs; fully local/offline workflow.

Context & Academic Integrity

  • Built as part of the SoftUni Data Science track to show practical mastery of data wrangling, feature engineering, conservative ML, and reporting.
  • Results are reproducible from the notebooks; all transformations are logged and saved as intermediate Parquet/CSV/PNG/JSON artifacts.

Data Sources & Credits

Can’t run the pipeline right now?

Put data.zip at the repo root and unzip it there. It will create the data/ folder with ready-made outputs so you can open the HTML report immediately.

Environment

  • Python 3.11

Install (example on Windows):

py -3.11 -m venv .venv
. .venv\Scripts\Activate.ps1
python -m pip install --upgrade pip wheel setuptools 
pip install -r requirements.txt
jupyter lab

Project structure

project/
├─ data/
│  ├─ labels.zip, MY.zip, cpi_2d_state.csv
│  ├─ *.parquet (created by notebooks)
│  └─ outputs/ (PNGs/CSVs/JSON created by 07)
├─ notebooks/
│  ├─ 01_ingest_labels.ipynb
│  ├─ 02_clean_normalize.ipynb
│  ├─ 03_entity_resolution.ipynb
│  ├─ 04_geocode_offline.ipynb
│  ├─ 05_enrich_holidays_cpi.ipynb
│  ├─ 06_categorize_rules.ipynb
│  └─ 07_analytics.ipynb
└─ project_report.html  (reads charts/tables from data/outputs/)

How to run (step‑by‑step)

  1. 01 – Ingest → creates data/receipts_raw.parquet.
  2. 02 – Cleandata/receipts_clean.parquet.
  3. 03 – Entity Resolutiondata/receipts_with_merchants.parquet (+ ER sweep).
  4. 04 – Geocode (offline)data/receipts_geocoded.parquet.
  5. 05 – Enrich (Holidays + CPI)data/receipts_enriched.parquet + enrich_metrics.json.
  6. 06 – Categorizedata/receipts_categorized.parquet + coverage JSON/CSVs.
  7. 07 – Analytics & Report → charts/CSVs into data/outputs/ and a summary JSON.

Report

Run 07 to refresh charts in data/outputs/. Then open project_report.html for a narrative view.

Quality gates

  • Uniqueness of receipt_id preserved (01/05).
  • ER: conservative threshold, threshold sweep chart (03).
  • CPI join: state → national → global fallback; real_total computed safely (05).
  • Categorization: rules (06).
  • Analytics: coverage, anomalies, YoY, MA3/MA6, Pareto/HHI, stacked categories, weekday/holiday, geo (07).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published