SoftUni · Data Science (Exam Project)
Author: Denis Milanov
Aim: Demonstrate a reproducible, end-to-end receipts analytics pipeline — ingest → cleaning → entity resolution → offline geocoding → holiday & CPI enrichment → merchant categorization (rules + optional ML) → analytics report.
Notes: Educational use only; no external APIs; fully local/offline workflow.
- Built as part of the SoftUni Data Science track to show practical mastery of data wrangling, feature engineering, conservative ML, and reporting.
- Results are reproducible from the notebooks; all transformations are logged and saved as intermediate Parquet/CSV/PNG/JSON artifacts.
- Receipts labels: SROIE dataset v2 (Kaggle)
labels.zip
— educational use.
https://www.kaggle.com/datasets/urbikn/sroie-datasetv2 - Postcodes (Malaysia): GeoNames — download the country postal code file
MY.zip
.
https://www.geonames.org/ - Consumer Price Index (Malaysia): DOSM via data.gov.my (state-level CPI)
cpi_2d_state.csv
.
https://data.gov.my/data-catalogue/cpi_state - Public holidays (Malaysia):
holidays
Python package.
https://pypi.org/project/holidays/
Put data.zip
at the repo root and unzip it there. It will create the data/ folder with ready-made outputs so you can open the HTML report immediately.
- Python 3.11
Install (example on Windows):
py -3.11 -m venv .venv
. .venv\Scripts\Activate.ps1
python -m pip install --upgrade pip wheel setuptools
pip install -r requirements.txt
jupyter lab
project/
├─ data/
│ ├─ labels.zip, MY.zip, cpi_2d_state.csv
│ ├─ *.parquet (created by notebooks)
│ └─ outputs/ (PNGs/CSVs/JSON created by 07)
├─ notebooks/
│ ├─ 01_ingest_labels.ipynb
│ ├─ 02_clean_normalize.ipynb
│ ├─ 03_entity_resolution.ipynb
│ ├─ 04_geocode_offline.ipynb
│ ├─ 05_enrich_holidays_cpi.ipynb
│ ├─ 06_categorize_rules.ipynb
│ └─ 07_analytics.ipynb
└─ project_report.html (reads charts/tables from data/outputs/)
- 01 – Ingest → creates
data/receipts_raw.parquet
. - 02 – Clean →
data/receipts_clean.parquet
. - 03 – Entity Resolution →
data/receipts_with_merchants.parquet
(+ ER sweep). - 04 – Geocode (offline) →
data/receipts_geocoded.parquet
. - 05 – Enrich (Holidays + CPI) →
data/receipts_enriched.parquet
+enrich_metrics.json
. - 06 – Categorize →
data/receipts_categorized.parquet
+ coverage JSON/CSVs. - 07 – Analytics & Report → charts/CSVs into
data/outputs/
and a summary JSON.
Run 07 to refresh charts in data/outputs/
. Then open project_report.html
for a narrative view.
- Uniqueness of
receipt_id
preserved (01/05). - ER: conservative threshold, threshold sweep chart (03).
- CPI join: state → national → global fallback;
real_total
computed safely (05). - Categorization: rules (06).
- Analytics: coverage, anomalies, YoY, MA3/MA6, Pareto/HHI, stacked categories, weekday/holiday, geo (07).