PySpark Version Notes (Python Tests in PySpark Mode)¶
When running Python tests with MOCK_SPARK_TEST_BACKEND=pyspark, we use PySpark 3.5.x and delta-spark 3.x (see tests/requirements-pyspark.txt). This file clarifies which test failures are due to PySpark version vs API differences (Sparkless extensions vs PySpark).
Sparkless internal APIs (not in PySpark; prefixed with _)¶
These are available as internal APIs (e.g. for tests or migration) but are not part of the PySpark-aligned public API:
| Feature | Sparkless (internal) | PySpark |
|---|---|---|
Pivot then .collect_list(col) etc. |
PivotedGroupedData._collect_list(), _collect_set(), _first(), _last(), _stddev(), _variance(), _count_distinct() |
❌ Use .agg(F.collect_list(col)) etc. |
ArrayType _element_type |
Property and __init__ kwarg _element_type in sql/types.py |
❌ Use elementType (camelCase) only |
| array() with mixed column types | Polars may coerce | ❌ Raises; same type required |
So the failing tests in PySpark mode are exercising Sparkless-only APIs or semantics. Sparkless implements them; PySpark does not (or behaves differently).
Current setup¶
- pyspark:
>=3.5,<3.6(3.5.x) - delta-spark:
>=3.0,<4(compatible with Spark 3.5) - Python: 3.8+ (venv may be 3.9). PySpark 4.x requires Python ≥3.10 for 4.1.x.
Failures that are not fixed by upgrading PySpark¶
These are API/behavior differences; PySpark (including 4.1.1) does not add these.
-
Pivot after
groupBy().pivot()
In PySpark,GroupedData(including after.pivot()) only has:agg,avg,count,max,min,sum.
There are no methods:.collect_list(),.collect_set(),.first(),.last(),.stddev(),.variance(),.count_distinct().
To match PySpark, use.agg(F.collect_list("col"))(and similarly for other aggs) instead of.collect_list("col").
The public API is PySpark-aligned (no such methods). Use.agg(F.collect_list("col"))etc. Internal helpers are available as._collect_list(),._first(), etc. -
array()with mixed types
Spark SQL’sarray()requires all elements to be the same type (or cast to a common type). PySpark raisesAnalysisExceptionfor mixed types in 3.5 and 4.x. This is by design, not an older-version limitation. -
ArrayTypeattribute name
PySpark useselementType(camelCase). Sparkless exposes that; internal_element_typeis also available.
Upgrading to PySpark 4.x¶
- Latest on PyPI: 4.1.1 (requires Python ≥3.10).
- delta-spark: For Spark 4 use delta-spark 4.x; for Spark 3.5 keep delta-spark 3.x.
- Upgrading to 4.x may fix or change a small number of behaviors (e.g. error messages, edge cases) but will not add the GroupedData pivot helpers or mixed-type
array()above. - To try PySpark 4.x: use a Python 3.10+ env and
pyspark>=4.0,<4.2anddelta-spark>=4.0(if you need Delta with Spark 4).
Summary¶
Most of the “missing” behavior in PySpark mode is due to tests targeting Sparkless-only APIs or semantics, not an older PySpark. Aligning tests with PySpark means using .agg(F.collect_list(...)) (and similar) for pivot, same-type array() only, and backend-agnostic type attribute names; upgrading PySpark alone will not fix those.
PySpark 4 roadmap¶
Sparkless 4.9.0 implements the opt-in PySpark 4 profile (compat=4.0). See PYSPARK_COMPAT_PROFILES.md and PYSPARK_4_PARITY_PLAN.md.
Dual-oracle setup (4.9.0+)¶
- Sparkless backend (default CI):
SPARKLESS_TEST_MODE=sparkless(or unset),SPARKLESS_PYSPARK_COMPAT=3.5(default). - PySpark 3.5 oracle:
tests/requirements-pyspark.txt,SPARKLESS_TEST_MODE=pyspark. - PySpark 4.1 oracle (nightly):
tests/requirements-pyspark4.txt, Java 17, Python 3.10+,SPARKLESS_TEST_MODE=pyspark,SPARKLESS_PYSPARK_COMPAT=4.0.
Run profile-specific tests with @pytest.mark.pyspark4_only or @pytest.mark.pyspark3_only.