Migration Guide (Python — Sparkless v4)¶
This page helps you switch to Sparkless v4 (Rust backend) from PySpark or from Sparkless 3.x (Polars Python backend).
Sparkless 3.x → v4¶
| Aspect | Sparkless 3.x | Sparkless v4 |
|---|---|---|
| Backend | Polars Python package | Rust crate (robin-sparkless) |
| Install | pip install sparkless |
pip install ./python (this repo) |
| Import | from sparkless.sql import SparkSession |
Same |
| API | PySpark-like | Same PySpark-like API |
| Runtime | Polars + Python | Native extension + Rust; no Polars Python |
What to do:
- Install the v4 package from the robin-sparkless repo:
pip install ./python(or from a built wheel). - Keep your existing
from sparkless.sql import SparkSession, functions as Fand DataFrame code. - Use
SparkSession.builder.app_name("...").get_or_create()if you already do; v4 supports the same builder API. - See PySpark differences for any behavioral differences from PySpark (and thus from Sparkless 3.x where it matches PySpark).
No code changes are required for typical tests and pipelines; the main difference is the execution engine (Rust instead of Polars Python).
Targeting PySpark 4 (opt-in, 4.9.0+)¶
Sparkless 4.9.0 ships PySpark 4 semantics as an opt-in profile. The default remains PySpark 3.5-like (compat=3.5, ANSI off).
from sparkless.sql import SparkSession
spark = SparkSession.builder.app_name("app").get_or_create()
spark.conf.set("sparkless.pyspark.compat", "4.0") # enables ANSI + 4.0 map/collect rules
Or set the environment variable before session creation:
See PYSPARK_COMPAT_PROFILES.md for bundled config keys and PYSPARK_4_PARITY_PLAN.md for the full roadmap.
Dual-oracle testing (PySpark 3.5 + 4.1)¶
| Oracle | Requirements | Env |
|---|---|---|
| PySpark 3.5 (default CI) | tests/requirements-pyspark.txt |
SPARKLESS_TEST_MODE=pyspark |
| PySpark 4.1 (nightly) | tests/requirements-pyspark4.txt, Java 17, Python 3.10+ |
SPARKLESS_TEST_MODE=pyspark + SPARKLESS_PYSPARK_COMPAT=4.0 |
Use @pytest.mark.pyspark4_only / pyspark3_only for profile-specific tests. The spark fixture applies sparkless.pyspark.compat automatically when running against the sparkless backend.
PySpark → Sparkless v4¶
Quick swap¶
# Before (PySpark)
from pyspark.sql import SparkSession, functions as F
# After (Sparkless v4)
from sparkless.sql import SparkSession, functions as F
Use SparkSession.builder.app_name("MyApp").get_or_create() or SparkSession("MyApp"). Your DataFrame operations, F.col, filter, select, groupBy, join, and SQL (spark.sql, createOrReplaceTempView) work the same way.
Common patterns¶
- Session:
SparkSession.builder.app_name("App").get_or_create()— same as PySpark. - DataFrames:
createDataFrame(data),createDataFrame(data, schema),read_csv,read_parquet,read_json,read_delta(when enabled). - Expressions:
F.col("x"),F.lit(...),F.when(...).otherwise(...), and built-in functions insparkless.sql.functions. - SQL:
df.createOrReplaceTempView("name"),spark.sql("SELECT ...")(when thesqlfeature is enabled in the Rust build).
What’s different or unsupported¶
See PySpark differences for the full list. Summary:
- RDD, streaming, mapInPandas: Not supported; use
collect()or local patterns. - Some SQL/DDL: Only a subset of statements (SELECT, temp views, saved tables, etc.); see PYSPARK_DIFFERENCES.
- Join on expression: Use column-name joins; expression joins (e.g.
df1.a == df2.bin the join API) are not supported. - UDFs: Python callable UDFs are not yet exposed; use built-in functions. Rust UDFs can be registered in the engine.
Testing both backends¶
You can run the same test suite against PySpark or Sparkless v4:
# Sparkless v4 (default in this repo)
pytest tests -v
# Real PySpark (requires Java + pyspark)
SPARKLESS_TEST_BACKEND=pyspark pytest tests -v
See also¶
- Getting started (Python)
- PySpark differences
- Package README — Sparkless 3 vs 4.x, installation, API overview