Skip to content

Conversation

tombaeyens
Copy link
Contributor

@tombaeyens tombaeyens commented Aug 27, 2025

(description by Paul)

This is a monster PR that incorporates several branches of work which were happening in parallel and then eventually wound up here. I believe the Athena support and the Snowflake auth addition have already been reviewed from a prior branch. The rest is new in this branch.
(update -- Athena was already merged but for some reason it's showing up in the diff as if it was new, not sure why)

Structural changes

  • Use new concept of DataSourceNamespaces to replace prefixes
  • Metadata columns query refactored, entrypoint moved from SQlDialect to DataSourceImpl, support for extracting precision & scale
  • Refactored Soda data types
  • Added contract interfaces*

New feature implementations

  • Athena support
  • Support precision and scale in numeric and datetime types
  • Added bidrectional data mappings for all data sources and listed supported/unsupported types
    • Mappings are a first pass, not final, we will review in a future PR. Don't consider mapping correctness in this review
  • Added snowflake oauth login path

Fixes

  • Fixed typo (supports_data_type_character_maximun_length) -> supports_data_type_character_maximum_length
  • fixed an issue with sql server regexes
  • Other small fixes

@tombaeyens tombaeyens marked this pull request as draft August 27, 2025 16:32
pre-commit-ci bot and others added 28 commits August 27, 2025 16:33
Copy link

@Niels-b Niels-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks good to me. Just left a few suggestions.
Now that tests are passing, I think its best to get this merged as soon as possible since this PR is already very large. If we need to make additional changes, we can always open a new branch and PR.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remark in general (not to be fixed in this PR): I think the get_sql_data_type_str_with_parameters(self) function should be something data source specific (i.e. put it in the dialect) instead of being a global one. Now it requires "hacky" workarounds in some data sources to get it to work.
Not to be fixed immediately (certainly not in this PR). This doesn't seem like an easy fix.

def default_numeric_scale(self) -> Optional[int]:
return None

def build_columns_metadata_query_str(self, table_namespace: DataSourceNamespace, table_name: str) -> str:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a large chunk of code to have in the dialect file.
I think the dialect is a nice place for this function, but maybe we can move the actual implementation to another file to keep it a bit cleaner?
No hard opinion, just a suggestion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this approach would be valid in a couple of places: replace a single file by a directory with multiple modules, maybe even introduce packages? E.g. splitting up the current soda_cloud.py a bit as well.

No hard requirement for the time being.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI; we tried moving this, but we can only really move two functions (build_columns_metadata_query_str and build_column_metadatas_from_query_result). The other functions are overwritten in the inheriting classes, so they require different logic. Currently makes little sense to add the extra complexity of moving those two functions, but still pointing towards the dialect. That makes it so you're bouncing between files way too much.
"Nice to have" at the moment to move all of this: we might need an extra "dialect-like" class for each datasource impl just to do the metadata stuff. But then the argument can be made to just keep it in the dialect 🙃 .

return self.sql_dialect.build_column_metadatas_from_query_result(query_result)

def build_columns_metadata_query_str(self, dataset_prefixes: list[str], dataset_name: str) -> str:
database_index: int | None = self.sql_dialect.get_database_prefix_index()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generic remark: I'm still not happy about the way we deal with the concepts of "database" and "schema", we still need to work on a proper terminology and/or abstraction IMHO.

This snippet still seems pretty brittle, but that's work for another time.

def default_numeric_scale(self) -> Optional[int]:
return None

def build_columns_metadata_query_str(self, table_namespace: DataSourceNamespace, table_name: str) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this approach would be valid in a couple of places: replace a single file by a directory with multiple modules, maybe even introduce packages? E.g. splitting up the current soda_cloud.py a bit as well.

No hard requirement for the time being.

Copy link

@Niels-b Niels-b marked this pull request as ready for review September 16, 2025 09:01
@Niels-b Niels-b merged commit 626a35b into v4 Sep 16, 2025
27 checks passed
@Niels-b Niels-b deleted the dwh-mapping-unsupported-types branch September 16, 2025 09:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants