Skip to content

Conversation

abdullahtariqq
Copy link

Overview

This PR enhances the dbt ingestion source with fine-grained filtering capabilities by adding database_pattern and schema_pattern configuration options. Users can now filter dbt nodes not only by node names but also by their database and schema attributes.

Changes Made

🔧 Core Implementation:

  • Added database_pattern and schema_pattern config fields using AllowDenyPattern
  • Enhanced _is_allowed_node() method to evaluate database and schema patterns alongside existing node name filtering
  • Maintained backward compatibility with defaults i.e. allow_all()

📝 Configuration Examples:

Filter by schema (e.g., only production schemas)

  schema_pattern:
    allow: ["prod_.*", "analytics"]

Filter by database (e.g., specific BigQuery project)

  database_pattern:
    allow: ["my-prod-project"]

Combined filtering

  node_name_pattern:
    deny: ["staging_.*"]
  schema_pattern:
    allow: ["prod_.*"]
  database_pattern:
    allow: ["my-prod-project"]

Use Cases

  • Multi-tenant environments: Filter by database/project to avoid cross-contamination
  • Environment isolation: Ingest only production schemas, skip staging/dev
  • Large dbt projects: Fine-tune ingestion scope to reduce noise and improve performance
  • Multi-project setups: Select specific databases when multiple are present

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Sep 5, 2025
- Add database_pattern and schema_pattern config fields with AllowDenyPattern support
- Enhance _is_allowed_node() to filter nodes by database and schema in addition to node names
- Add comprehensive integration tests for new filtering capabilities
- Support combined filtering patterns for fine-grained dbt ingestion control
@abdullahtariqq abdullahtariqq force-pushed the feat-dbt-improve-filtering branch from b7cbd87 to d702148 Compare September 5, 2025 15:00
@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Sep 5, 2025
Copy link

codecov bot commented Sep 5, 2025

Bundle Report

Changes will decrease total bundle size by 645 bytes (-0.0%) ⬇️. This is within the configured threshold ✅

Detailed changes
Bundle name Size Change
datahub-react-web-esm 28.56MB -645 bytes (-0.0%) ⬇️

Affected Assets, Files, and Routes:

view changes for bundle: datahub-react-web-esm

Assets Changed:

Asset Name Size Change Total Size Change (%)
assets/index-*.js -645 bytes 18.91MB -0.0%

Copy link

codecov bot commented Sep 6, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Sep 8, 2025
@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Sep 8, 2025
@skrydal
Copy link
Collaborator

skrydal commented Sep 8, 2025

Hello @abdullahtariqq , thank you for sharing this contribution. I think the feature is a bit complex, due to the details of dbt references handling. To be specific, node.dbt_name is completely independent of node.schema, node.database or node.name - as the former is an internal dbt reference, while others might refer to an external table.
Therefore, I would introduce a clear separation between those two checks, to avoid confusion, for example we could have separate config field: source_pattern, it should contain trifecta "table", "schema", "database", and follow DataHub-wide pattern matching:

  1. database pattern is matched against database name
  2. schema pattern is matched against f"{database}.{schema}" (fully qualified name)
  3. table pattern is matched against f"{database}.{schema}.{table}" (fully qualified name)

There is an open question whether it should be done against nodes of type source or all external references, what do you think about it?

The _is_node_allowed function would work the following way:

  1. if dbt_name is not allowed, return False
  2. if node is of type source, then match node.name, node.schema, node.database against the patterns, as specified above

Is this aligned with the use-case you had in mind when creating it?

Of course golden files need to be aligned properly.

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Sep 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata pending-submitter-response Issue/request has been reviewed but requires a response from the submitter
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants