An ELT pipeline with Docker, GCP, Airflow, Terraform, GCP and dbt using Airbnb data
Suppose you are a data engineer at Airbnb. How would you design a data pipeline that meets the needs of data analysts, data scientists, and even LLM engineers?
We'll take on this challenge by using monthly data from Inside Airbnb, an initiative that gathers Airbnb listings, reviews, calendar data, and neighborhood information from many countries. Our goal is to build a data pipeline from scratch. Given the large volume of data, we use the records from Taipei, Taiwan, covering the period between August 2022 and December 2024. We begin by extracting and loading this data into a data lake and a data warehouse, and we complete this ELT pipeline by constructing data marts, enabling data analysts and data scientists to query the necessary columns for exploration and insight generation.
📝Read the project docs generated with dbt to know more about this data.
- On Premise and Cloud ELT solutions
- Custom dbt macros and generic tests for Bigquery
- Multiple data marts for different stakeholders
- Data analyst - Summary information and metrics included in listings.csv for listings in Taipei (good for visualisations)
- Data scientist - Features generated based on From Data to Insights: Segmenting Airbnb’s Supply from The Airbnb Tech Blog for supply pattern segmentation
- Dimension modelling for better understanding the dataset
- Apache Airflow for data orchestration
- Google Cloud Storage for datalake
- Google BigQuery for data warehouse
- dbt for data transformation, tests and docs
- Astronomer Cosmos for better integration between Apache Airflow and dbt
- Looker Studio for data visualization
- Terraform for cloud resource provision
- Docker for containerizing services
- PostgresQL (with PostGIS extention enabled) for on-prem warehouse and airflow backend
Final result from Google Looker Studio. Link here.
Since everything is containerized in the project, the only prerequisite is to install Docker. You can choose either of the below solutions to build and run a pipeline.
Note
Due to the eol (end-of-line) issue, you may want to set git config --global core.autocrlf input
before you clone the repo if you are using WSL, Linux, or MacOS.
- Git clone this repo.
- Run docker compose under the repo directory:
docker compose up
- Access
localhost:8080
with your browser and login Airflow withadmin:admin
- Click the toggle of
postgres-elt
to activate the DAG
under maintenance
- Git clone this repo.
- Create a project
<your-project>
and a service account. Grant your service account the following two roles:BigQuery Admin
andStorage Admin
. - Download the service account key file into the directory
credentials/
and rename it toapplication_default_credentials.json
- See here for more details about setting up GCP project and credentials.
- In your shell, set the following environment variables and run docker compose:
$> export TF_VAR_PROJECT=<your-project> # e.g. de-project-demo
$> export TF_VAR_BUCKET_NAME=<your-bucket-name> # e.g. de-project-bucket-72116
$> export TF_VAR_DATASET_ID=<your-dataset-id> # e.g. de_project_dataset_72116
$> docker compose up
- After Airflow is ready (see note), access
localhost:8080
with your browser for the Airflow webUI and login withadmin:admin
- Click the toggle of
gcp-elt
to activate the dag
Note
GCP Settings
The bucket <your-bucket-name>
and the dataset <your-dataset-id>
will be created as the pipeline runs. You don't need to create them manually. However,
- The bucket name should be globally unique, so we add a random number after the bucket name.
- The dataset id does not allow hyphens, so we have to use underscores instead.
Warning
Even though the credentials directory is listed in .gitignore to prevent accidental uploads of credential keys, do not push the cloned repo unless you are certain there are no settings that could lead to credential leakage!
Note
When is your Airflow ready?
You can open the webUI of Airflow once you see something quite similar to the following messages:
webserver-1 | [2025-04-02 08:54:28 +0000] [87] [INFO] Listening at: http://0.0.0.0:8080 (87)
webserver-1 | [2025-04-02 08:54:28 +0000] [87] [INFO] Using worker: sync
webserver-1 | [2025-04-02 08:54:28 +0000] [347] [INFO] Booting worker with pid: 347
webserver-1 | [2025-04-02 08:54:29 +0000] [348] [INFO] Booting worker with pid: 348
webserver-1 | [2025-04-02 08:54:30 +0000] [349] [INFO] Booting worker with pid: 349
webserver-1 | [2025-04-02 08:54:32 +0000] [350] [INFO] Booting worker with pid: 350
If everything works as expected, you will get your Airflow orchestration results like this:
- If you are still running the docker container
- Run DAG
gcp-tear-down
and wait until it completes all the tasks - Press
Ctrl+C
to stop the running container - Run
docker compose down
under the repo directory to remove containers.
- Run DAG
- If you have already
docker compose down
-ed the services- Run
terraform destroy
under the directoryterraform/
(we can still utilize lock files and other terraform config files to remove cloud resources if you have Terraform installed in your OS.) - Or delete them from google cloud console manually
- Run
Here is some information for those who want to modify or extend this project for their own use.
The variable test_periods
controls how many months this data pipeline processes. The default value is set to 1 to facilitate testing or replication.
test_periods = 1
with DAG(
"gcp-elt",
# ...
schedule=get_schedule(-test_periods),
# ...
):
# ...
If you would like to obtain data for more than a month, for instance, you can set test_periods = 13
(to create meaningful yearly features for DS data marts) to store listings data of last 13 months in the data warehouse.
To get the full dataset which I visualized in my dashboard, you modify the get_schedule()
argument like:
with DAG(
"gcp-elt",
# ...
schedule=get_schedule(start=0), # or simply: get_schedule()
# ...
):
# ...
We summarize how the components in the data pipeline are set up and where you can find these configurations in this repo.
- The database account (username:password =
airflow:airflow
) and the default databaseairflow
of PostgreSQL are set up indocker-compose.yml
.
- Airflow-related environment variables are defined in
airflow/airflow.env
, which is passed to Docker Compose as an env-file. - Airflow's login credentials (username:password = admin:admin) are set up in
entrypoint.sh
. The scriptentrypoint.sh
spins up the airflow webserver and scheduler after the airflow user account created. - The
airflow/Dockerfile
installs Terraform and Python dependencies listed in requirements.txt during the build stage.
- Terraform receives cloud-related environment variables (prefixed with
TF_VAR_
) from Docker Compose (-e
options) and applies them in/terraform/main.tf
to provision cloud instances. - Airflow dags obtain these environment variables by refering to
os.environ
in Python.
- dbt profiles are stored separately for cloud and on-premises setups under
dbt/gcp
anddbt/postgres
, respectively. - The cloud dbt profile reuses the Terraform environment variables (
TF_VAR_*
) to connect to Google Cloud Platform (GCP).
- By default, Google Cloud client looks for its key file at
~/.config/gcloud/application_default_credentials.json
(See here). To ensure access within the Docker container, we bind-mount the/credentials
directory to/root/.config/gcloud/
. - For simplicity, both Terraform (
main.tf
) and dbt (dbt/gcp/profiles.yml
) use the same key file (application_default_credentials.json
) to interact with Google Cloud.