- Introduction
- Project Description
- Requirements
- Installation Instructions
- Running Python Scripts Locally
- Makefile Commands
Royal Blue is an Extract, Load, Transform (ETL) data pipeline built on top of AWS using Python, Terraform and Pandas.
It was built as a graduation project for the Northcoders Data Engineering in Python bootcamp that ran from March to June 2025.
The purpose of this repository is to build an entire ETL (Extract, Load, Transform) data pipeline in AWS (Amazon Web Services).
It implements a robust and scalable solution for extracting, transforming, and loading data from an OLTP (Online Transaction Processing) PostgreSQL database and loading it into an OLAP (Online Analytical Processing) database.
Follow these links for the origin and destination database Entity Relationship Diagrams.
The data is transformed from transactional day-to-day business data into a Data Analysis-ready format, suitable for multiple Business intelligence purposes.
It uses Python as the main programming language, followed by Terraform for infrastructure as code. It also uses Bash and SQL Scripts to help with build processes and integration testing, and has a full-featured Makefile for convenience.
This project follows the specs proposed for the Northcoders Data Engineering graduation project and was developed as a group effort by @theorib, @Brxzee, @charleybolton, @JanetteSamuels, @sxnfer and @josephtheodore.
-
This project uses uv to manage Python environments, dependencies, running scripts, and for our build process. Make sure it is installed by following the official guide.
-
You will also need to install the latest version of Python (3.13.3 at the time of this writing).
-
For local development, you will need the AWS CLI installed and configured with your AWS credentials.
-
To run some of the integration tests, you will need to install PostgreSQL locally v14 or higher.
Programming Language & Runtime
- Python 3.13.3+
Core Python Dependencies
- Psycopg 3 - PostgreSQL database adapter
- Boto3 - for AWS SDK integration
- pandas - for data manipulation
- PyArrow - for Parquet file handling
- Pydantic - for data validation and JSON serialisation
Development Dependencies
- pytest - and related plugins for testing and coverage
- pytest-postgresql - using for running integration tests against local PostgreSQL databases
- Ruff - for linting and formatting
- Moto - for mocking AWS services during tests
- Bandit - for vulnerability and security scanning of source code
- IPython Kernel - for VS Code Jupyter notebook support, used for testing and experimenting
Databases
- PostgreSQL (used locally for integration tests, as well as being the database used on both sides of the data pipeline).
AWS
- Lambda, S3, Step Functions, IAM, Cloudwatch, SNS Email alerts, etc. All accessed using
boto3
deployed with Terraform and tested using Moto.
Utilities & Tooling
uv
for managing Python environments, dependencies, and running scriptsMakefile
for convenience, centralising and simplifying the project’s most common commands (testing, linting, formatting, deployment, etc).
Local Testing Scripts
- Bash scripts to run SQL test files against the local PostgreSQL database and capture output for validation.
-
Clone or fork this repository and download it to your local machine:
git clone https://github.com/theorib/royal-blue.git
-
Change directory into the cloned repository:
cd royal-blue
-
Create a
.env
file at the root, based on the.env.example
provided. Ensure essential variables like those starting withDB_
are set. Others, like S3 bucket names, are only needed if you plan to run scripts locally.TOTESYS_DB_USER=some_user_abc TOTESYS_DB_PASSWORD=some_password_xyz TOTESYS_DB_HOST=host.something.com TOTESYS_DB_DATABASE=database_name TOTESYS_DB_PORT=0000 DATAWAREHOUSE_DB_USER=some_user_abc DATAWAREHOUSE_DB_PASSWORD=some_password_xyz DATAWAREHOUSE_DB_HOST=host.something.com DATAWAREHOUSE_DB_DATABASE=database_name DATAWAREHOUSE_DB_PORT=0000 # For local integration tests only: INGEST_ZONE_BUCKET_NAME=some_bucket_name PROCESS_ZONE_BUCKET_NAME=some_bucket_name LAMBDA_STATE_BUCKET_NAME=another_bucket_name
-
If you forked this repository and want CI/CD to work as intended, you will have to create GitHub Secrets for the above environment variables (except for the local integration ones).
-
Run the setup script (this will install dependencies, run tests and checks):
make setup
With uv
managing the environment, running your scripts is clean and consistent. Here's how to start:
-
Activate the Python virtual environment:
source .venv/bin/activate
-
Set the
PYTHONPATH
environment variable to the current directory:export PYTHONPATH=$(pwd)
-
Point
uv
to your local.env
file so that environment variables are available to running scripts:export UV_ENV_FILE=.env
-
Run Python scripts or tests using
uv run
:uv run src/lambdas/extract_lambda.py
Example on how to run tests:
uv run pytest
Use these main commands for common tasks:
Command | Description |
---|---|
make setup |
Complete installation and validation (sync, build, checks) |
make test |
Run all tests |
make fix |
Run formatter and linter |
make safe |
Run security scans |
make help |
Show all available make commands |
For a full list of commands and their description, run:
make help