feat(dbt): add database and schema pattern filtering support #14689

abdullahtariqq · 2025-09-05T14:58:06Z

Overview

This PR enhances the dbt ingestion source with fine-grained filtering capabilities by adding database_pattern and schema_pattern configuration options. Users can now filter dbt nodes not only by node names but also by their database and schema attributes.

Changes Made

🔧 Core Implementation:

Added database_pattern and schema_pattern config fields using AllowDenyPattern
Enhanced _is_allowed_node() method to evaluate database and schema patterns alongside existing node name filtering
Maintained backward compatibility with defaults i.e. allow_all()

📝 Configuration Examples:

Filter by schema (e.g., only production schemas)

  schema_pattern:
    allow: ["prod_.*", "analytics"]

Filter by database (e.g., specific BigQuery project)

  database_pattern:
    allow: ["my-prod-project"]

Combined filtering

  node_name_pattern:
    deny: ["staging_.*"]
  schema_pattern:
    allow: ["prod_.*"]
  database_pattern:
    allow: ["my-prod-project"]

Use Cases

Multi-tenant environments: Filter by database/project to avoid cross-contamination
Environment isolation: Ingest only production schemas, skip staging/dev
Large dbt projects: Fine-tune ingestion scope to reduce noise and improve performance
Multi-project setups: Select specific databases when multiple are present

- Add database_pattern and schema_pattern config fields with AllowDenyPattern support - Enhance _is_allowed_node() to filter nodes by database and schema in addition to node names - Add comprehensive integration tests for new filtering capabilities - Support combined filtering patterns for fine-grained dbt ingestion control

codecov · 2025-09-05T15:07:39Z

Bundle Report

Changes will decrease total bundle size by 645 bytes (-0.0%) ⬇️. This is within the configured threshold ✅

Detailed changes

Bundle name	Size	Change
datahub-react-web-esm	28.56MB	-645 bytes (-0.0%) ⬇️

Affected Assets, Files, and Routes:

view changes for bundle: datahub-react-web-esm

Assets Changed:

Asset Name	Size Change	Total Size	Change (%)
`assets/index-*.js`	-645 bytes	18.91MB	-0.0%

codecov · 2025-09-06T00:39:24Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_common.py

skrydal · 2025-09-08T11:40:36Z

Hello @abdullahtariqq , thank you for sharing this contribution. I think the feature is a bit complex, due to the details of dbt references handling. To be specific, node.dbt_name is completely independent of node.schema, node.database or node.name - as the former is an internal dbt reference, while others might refer to an external table.
Therefore, I would introduce a clear separation between those two checks, to avoid confusion, for example we could have separate config field: source_pattern, it should contain trifecta "table", "schema", "database", and follow DataHub-wide pattern matching:

database pattern is matched against database name
schema pattern is matched against f"{database}.{schema}" (fully qualified name)
table pattern is matched against f"{database}.{schema}.{table}" (fully qualified name)

There is an open question whether it should be done against nodes of type source or all external references, what do you think about it?

The _is_node_allowed function would work the following way:

if dbt_name is not allowed, return False
if node is of type source, then match node.name, node.schema, node.database against the patterns, as specified above

Is this aligned with the use-case you had in mind when creating it?

Of course golden files need to be aligned properly.

github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Sep 5, 2025

abdullahtariqq force-pushed the feat-dbt-improve-filtering branch from b7cbd87 to d702148 Compare September 5, 2025 15:00

datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Sep 5, 2025

vercel bot deployed to Preview September 5, 2025 15:45 View deployment

abdullahtariqq force-pushed the feat-dbt-improve-filtering branch from 05d2222 to d702148 Compare September 6, 2025 00:03

skrydal reviewed Sep 8, 2025

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_common.py Outdated Show resolved Hide resolved

datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Sep 8, 2025

fix: apply proposed changes

8f53dd9

datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Sep 8, 2025

vercel bot deployed to Preview September 8, 2025 11:09 View deployment

datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Sep 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(dbt): add database and schema pattern filtering support #14689

feat(dbt): add database and schema pattern filtering support #14689

abdullahtariqq commented Sep 5, 2025

Uh oh!

codecov bot commented Sep 5, 2025 •

edited

Loading

Assets Changed:

Uh oh!

codecov bot commented Sep 6, 2025

Uh oh!

Uh oh!

skrydal commented Sep 8, 2025

Uh oh!

Uh oh!

feat(dbt): add database and schema pattern filtering support #14689

Are you sure you want to change the base?

feat(dbt): add database and schema pattern filtering support #14689

Conversation

abdullahtariqq commented Sep 5, 2025

Overview

Changes Made

🔧 Core Implementation:

📝 Configuration Examples:

Filter by schema (e.g., only production schemas)

Filter by database (e.g., specific BigQuery project)

Combined filtering

Use Cases

Uh oh!

codecov bot commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bundle Report

Affected Assets, Files, and Routes:

Assets Changed:

Uh oh!

codecov bot commented Sep 6, 2025

Codecov Report

Uh oh!

Uh oh!

skrydal commented Sep 8, 2025

Uh oh!

Uh oh!

codecov bot commented Sep 5, 2025 •

edited

Loading