Remove dependency of soda-core-spark-df on pyspark #2218

ghjklw · 2025-03-08T01:15:41Z

Soda-core-spark-df declares a dependency on pyspark>=3.4.0, but pyspark used only for typing, and should therefore be declared as an extra dependency to avoid unnecessary dependency conflicts.

Closes #2217

sonarqubecloud · 2025-03-08T01:23:35Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

m1n0 · 2025-03-17T12:41:33Z

Thanks for the contrib @ghjklw. The pinned version ensures that the expected programmatic api (i.e. classes in this cases) is present. Version we are using is 2 years old, do you think that is still too much to ask?

ghjklw · 2025-03-17T13:51:57Z

Hi @m1n0, I think that in this case the version requirement is perfectly reasonable, no issue with that 🙂

The specific challenge I'm facing is that the packages pyspark and databricks-connect both provide the module pyspark and are incompatible as documented in https://docs.databricks.com/aws/en/dev-tools/databricks-connect/python/install#install-the-databricks-connect-client-with-venv and .https://docs.databricks.com/aws/en/dev-tools/databricks-connect/python/troubleshooting#conflicting-pyspark-installations

To be fair, this is much more of a Databricks issue than a Soda issue, but it's also much easier to fix on this side 😉 In practice, when I override soda-core-spark-df dependency on pyspark, it works flawlessly.

By "ensures that the expected programmatic api (i.e. classes in this cases) is present", do you mean:

when running unit/integration tests? In this case I believe my PR covers it as we still install PySpark to run the tests
at runtime? If that is the case, I think (personal opinion 😉) that the solution has stronger consequences than the risk it mitigates. The only PySpark methods used that I found in the code are: SparkSession.sql, DataFrame.collect, DataFrame.count, DataFrame.offset, DataFrame.limit, DataFrame.schema.fields. Of these, only offset is somewhat new (version 3.4.0).

Do you think it might be a good enough solution to update the documentation to say to install soda-spark-df[pyspark] unless you have an existing pyspark>=3.4.0 installation? This would work for 99% of persons, but those who need the flexibility on how they manage their PySpark environment would still be able to do so. You could even add a specific note about databricks-connect.

m1n0 · 2025-03-18T13:17:04Z

The offset method is what caused the pin to be >= 3.4.0. I don't think just having it in docs would be a good-enough UX, especially coming back from the more strict one that we have now. The pin was motivated by multiple users coming to us with .cursor() is not defined issues.

I would rather move forward and not back on this, how about and extra that removes the pyspark dependency? Doing that and adding it to docs would not break existing user experience and allow users in scenarios like yours to move forward without friction

ghjklw · 2025-03-19T06:30:14Z

That would be a good option, but I'm not aware of any way to define an extra with "negative" dependencies.

There have been several discussions related to similar problems:

Negative extras: https://discuss.python.org/t/the-extra-environment-marker-and-its-operators/4976
Alternative dependencies (i.e. declaring that databricks-connect is a valid alternative to pyspark): https://discuss.python.org/t/require-any-of-several-alternative-package-dependencies/26293
Default extras (i.e. setting pyspark in an extra group and declaring that this extra skills be installed by default): https://peps.python.org/pep-0771/
This last draft PEP actually discusses the exact problem we are facing: "Recommended but not required dependencies"

It looks like this will be coming via will take some time.

There are only 2 options left I can think of:

Runtime check, but this might not really address your concern for backward compatibility.
Defining a separate package, e.g. soda-core-spark-df vs soda-spark-df which adds the pyspark dependency. However this is a quite heavy solution...

migueldoblado · 2025-05-25T17:50:27Z

The offset method is what caused the pin to be >= 3.4.0. I don't think just having it in docs would be a good-enough UX, especially coming back from the more strict one that we have now. The pin was motivated by multiple users coming to us with .cursor() is not defined issues.

I would rather move forward and not back on this, how about and extra that removes the pyspark dependency? Doing that and adding it to docs would not break existing user experience and allow users in scenarios like yours to move forward without friction

Hi, just in case it adds something to the discussion. I had some issues with local tests upgrading from soda-core-spark-df 3.3.5 to 3.3.6, because of the offset method. If I am not wrong, the offset method is available in Scala in Apache Spark from 3.4.0 (as mentioned in doc) but it was added to pyspark in 3.5.0 release so executing it in previous versions fail despite the 3.4.0 pyspark requirement.

ghjklw added 2 commits March 8, 2025 00:27

Remove dependency of soda-core-spark-df on pyspark

06e112c

Add pyspark optional dependency to requirements.txt

1156556

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove dependency of soda-core-spark-df on pyspark #2218

Remove dependency of soda-core-spark-df on pyspark #2218

Uh oh!

ghjklw commented Mar 8, 2025

Uh oh!

sonarqubecloud bot commented Mar 8, 2025

Uh oh!

m1n0 commented Mar 17, 2025

Uh oh!

ghjklw commented Mar 17, 2025 •

edited

Loading

Uh oh!

m1n0 commented Mar 18, 2025

Uh oh!

ghjklw commented Mar 19, 2025

Uh oh!

migueldoblado commented May 25, 2025

Uh oh!

Uh oh!

Remove dependency of soda-core-spark-df on pyspark #2218

Are you sure you want to change the base?

Remove dependency of soda-core-spark-df on pyspark #2218

Uh oh!

Conversation

ghjklw commented Mar 8, 2025

Uh oh!

sonarqubecloud bot commented Mar 8, 2025

Quality Gate passed

Uh oh!

m1n0 commented Mar 17, 2025

Uh oh!

ghjklw commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

m1n0 commented Mar 18, 2025

Uh oh!

ghjklw commented Mar 19, 2025

Uh oh!

migueldoblado commented May 25, 2025

Uh oh!

Uh oh!

ghjklw commented Mar 17, 2025 •

edited

Loading