Skip to content

Sparkless vs PySpark: Parity Issues Reported Upstream

When comparing Sparkless (the Python PySpark drop-in at github.com/eddiethedean/sparkless) with PySpark, we found several parity issues and reported them on the Sparkless repo. This document lists those issues and the script used to find them.

How to reproduce

From the repo root:

pip install sparkless
python scripts/sparkless_parity_check.py

The script runs a battery of PySpark-idiomatic operations against Sparkless and prints which checks fail.

Reported issues (Sparkless GitHub)

Issue Title Link
#412 API: SparkSession.builder() causes TypeError (not callable) https://github.com/eddiethedean/sparkless/issues/412
#413 union() fails with createDataFrame(data, column_names): column order/name mismatch https://github.com/eddiethedean/sparkless/issues/413
#414 Window functions: row_number().over(WindowSpec) fails with 'over() got an unexpected keyword argument \'descending\'' https://github.com/eddiethedean/sparkless/issues/414
#415 DataFrame.orderBy(list of column names) treats list as single column https://github.com/eddiethedean/sparkless/issues/415

Not reported (by design)

  • toPandas() requires pip install sparkless[pandas] (or pandas). PySpark also requires pandas for toPandas(), so this is an optional-dependency choice, not a behavioral parity bug.

Summary

  • builder(): Use SparkSession.builder.appName(...).getOrCreate() (no parentheses on builder) when using Sparkless.
  • union(): Use unionByName() if you hit column order mismatch with createDataFrame(..., list_of_names).
  • Window + row_number(): Avoid until fixed upstream, or use a workaround (e.g. order in a separate step).
  • orderBy([list]): Use orderBy("a", "b") or orderBy(*["a", "b"]) instead of orderBy(["a", "b"]) in Sparkless.