Skip to content

Conversation

ghjklw
Copy link
Contributor

@ghjklw ghjklw commented Mar 8, 2025

Soda-core-spark-df declares a dependency on pyspark>=3.4.0, but pyspark used only for typing, and should therefore be declared as an extra dependency to avoid unnecessary dependency conflicts.

Closes #2217

Copy link

sonarqubecloud bot commented Mar 8, 2025

@m1n0
Copy link
Contributor

m1n0 commented Mar 17, 2025

Thanks for the contrib @ghjklw. The pinned version ensures that the expected programmatic api (i.e. classes in this cases) is present. Version we are using is 2 years old, do you think that is still too much to ask?

@ghjklw
Copy link
Contributor Author

ghjklw commented Mar 17, 2025

Hi @m1n0, I think that in this case the version requirement is perfectly reasonable, no issue with that 🙂

The specific challenge I'm facing is that the packages pyspark and databricks-connect both provide the module pyspark and are incompatible as documented in https://docs.databricks.com/aws/en/dev-tools/databricks-connect/python/install#install-the-databricks-connect-client-with-venv and .https://docs.databricks.com/aws/en/dev-tools/databricks-connect/python/troubleshooting#conflicting-pyspark-installations

To be fair, this is much more of a Databricks issue than a Soda issue, but it's also much easier to fix on this side 😉 In practice, when I override soda-core-spark-df dependency on pyspark, it works flawlessly.

By "ensures that the expected programmatic api (i.e. classes in this cases) is present", do you mean:

  • when running unit/integration tests? In this case I believe my PR covers it as we still install PySpark to run the tests

  • at runtime? If that is the case, I think (personal opinion 😉) that the solution has stronger consequences than the risk it mitigates. The only PySpark methods used that I found in the code are: SparkSession.sql, DataFrame.collect, DataFrame.count, DataFrame.offset, DataFrame.limit, DataFrame.schema.fields. Of these, only offset is somewhat new (version 3.4.0).

    Do you think it might be a good enough solution to update the documentation to say to install soda-spark-df[pyspark] unless you have an existing pyspark>=3.4.0 installation? This would work for 99% of persons, but those who need the flexibility on how they manage their PySpark environment would still be able to do so. You could even add a specific note about databricks-connect.

@m1n0
Copy link
Contributor

m1n0 commented Mar 18, 2025

The offset method is what caused the pin to be >= 3.4.0. I don't think just having it in docs would be a good-enough UX, especially coming back from the more strict one that we have now. The pin was motivated by multiple users coming to us with .cursor() is not defined issues.

I would rather move forward and not back on this, how about and extra that removes the pyspark dependency? Doing that and adding it to docs would not break existing user experience and allow users in scenarios like yours to move forward without friction

@ghjklw
Copy link
Contributor Author

ghjklw commented Mar 19, 2025

That would be a good option, but I'm not aware of any way to define an extra with "negative" dependencies.

There have been several discussions related to similar problems:

It looks like this will be coming via will take some time.

There are only 2 options left I can think of:

  • Runtime check, but this might not really address your concern for backward compatibility.
  • Defining a separate package, e.g. soda-core-spark-df vs soda-spark-df which adds the pyspark dependency. However this is a quite heavy solution...

@migueldoblado
Copy link

The offset method is what caused the pin to be >= 3.4.0. I don't think just having it in docs would be a good-enough UX, especially coming back from the more strict one that we have now. The pin was motivated by multiple users coming to us with .cursor() is not defined issues.

I would rather move forward and not back on this, how about and extra that removes the pyspark dependency? Doing that and adding it to docs would not break existing user experience and allow users in scenarios like yours to move forward without friction

Hi, just in case it adds something to the discussion. I had some issues with local tests upgrading from soda-core-spark-df 3.3.5 to 3.3.6, because of the offset method. If I am not wrong, the offset method is available in Scala in Apache Spark from 3.4.0 (as mentioned in doc) but it was added to pyspark in 3.5.0 release so executing it in previous versions fail despite the 3.4.0 pyspark requirement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

soda-core-spark-df unnecessary dependency on PySpark
3 participants