Getting Started (Python — Sparkless v4)¶

This guide is for the Sparkless v4 Python package: a PySpark-like API backed by the robin-sparkless Rust engine. Sparkless 3.x uses the Polars Python package; 4.x uses the Rust crate (no Polars Python at runtime).

Installation¶

Before you adopt: Read Before you adopt for UDF limits, parity gaps, and production caveats.

From PyPI (recommended)¶

pip install "sparkless>=4,<5"

Requires Python 3.8+. Prebuilt wheels: Linux (glibc and musl), macOS (arm64 and x86_64), Windows (x86_64 and arm64). See Supported platforms in the package README.

Optional extras:

pip install "sparkless[dev]"     # pytest, pandas, hypothesis, pytest-xdist
pip install "sparkless[pyspark]" # run tests with real PySpark (requires Java)

From source (contributors)¶

Clone robin-sparkless and install from the repo:

pip install ./python
# Or editable install (rebuilds native extension on changes):
cd python && maturin develop

See CONTRIBUTING.md for the full dev workflow.

Quick Start¶

Basic Example¶

from sparkless.sql import SparkSession, functions as F

# Create session
spark = SparkSession.builder.app_name("MyApp").get_or_create()

# Create DataFrame
data = [
    {"id": 1, "name": "Alice", "age": 25},
    {"id": 2, "name": "Bob", "age": 30},
]
df = spark.createDataFrame(data)

# Operations work like PySpark
result = df.filter(F.col("age") > 25).select("name")
print(result.collect())  # [Row(name='Bob')]

df.show()
spark.stop()

PySpark-style import swap¶

Sparkless v4 matches PySpark APIs for tests and local workflows (not a full Spark cluster replacement — see Before you adopt):

# Before (PySpark)
from pyspark.sql import SparkSession

# After (Sparkless v4)
from sparkless.sql import SparkSession

Use SparkSession.builder.app_name("...").get_or_create() or SparkSession("AppName"); the rest of your PySpark-style code can stay the same.

Core Features¶

DataFrame Operations¶

from sparkless.sql import SparkSession, functions as F

spark = SparkSession.builder.app_name("Example").get_or_create()
data = [
    {"name": "Alice", "dept": "Engineering", "salary": 80000},
    {"name": "Bob", "dept": "Sales", "salary": 75000},
    {"name": "Charlie", "dept": "Engineering", "salary": 90000},
]
df = spark.createDataFrame(data)

# Filter and select
high_earners = df.filter(F.col("salary") > 75000)
names = df.select("name", "dept")

# Aggregations
dept_avg = df.groupBy("dept").avg("salary")

Window Functions¶

from sparkless.sql import Window, functions as F

window_spec = Window.partitionBy("dept").orderBy(F.desc("salary"))
ranked = df.withColumn("rank", F.row_number().over(window_spec))

SQL Queries¶

df.createOrReplaceTempView("employees")
result = spark.sql("SELECT name, salary FROM employees WHERE salary > 80000")
result.show()

Lazy Evaluation¶

Transformations (filter, select, join, etc.) are queued; execution happens on actions (collect(), count(), show(), write), matching PySpark’s model.

Testing with Sparkless v4¶

The sparkless.testing module provides a unified framework for writing tests that run against both sparkless and PySpark backends.

Quick Setup¶

Add to your conftest.py:

pytest_plugins = ["sparkless.testing"]

Unit Test Example¶

def test_data_transformation(spark, spark_imports):
    """Test DataFrame logic against sparkless or PySpark."""
    F = spark_imports.F

    data = [{"value": 10}, {"value": 20}, {"value": 30}]
    df = spark.createDataFrame(data)

    result = df.filter(F.col("value") > 15)

    assert result.count() == 2
    rows = result.collect()
    assert rows[0]["value"] == 20
    assert rows[1]["value"] == 30

Running Tests¶

# Fast local tests (sparkless backend)
pytest tests -v

# Validate against PySpark (requires Java)
SPARKLESS_TEST_MODE=pyspark pytest tests -v

Key Features¶

Fixtures: spark, spark_mode, spark_imports, isolated_session, table_prefix
Markers: @pytest.mark.sparkless_only, @pytest.mark.pyspark_only
Comparison utilities: assert_dataframes_equal(), compare_dataframes()

For the complete guide, see Testing Guide.

Performance¶

Sparkless v4 uses the Rust engine (Polars in Rust). There is no JVM and no Polars Python dependency at runtime.

Operation	PySpark	Sparkless v4
Session creation	30–45s	< 1s
Simple query	2–5s	< 0.1s
Full test suite	5–10 min	1–2 min

Next Steps¶

Testing Guide — Full guide to sparkless.testing (dual-mode testing, fixtures, comparison utilities)
Package README — Installation, Sparkless 3 vs 4.x, API overview, backend
PySpark differences — Known divergences and caveats
Migration (PySpark / Sparkless 3) — Switching from PySpark or Sparkless 3.x
Parity status — Coverage and fixture status

Getting Help¶

Repository: github.com/eddiethedean/robin-sparkless
Issues: github.com/eddiethedean/robin-sparkless/issues
Sparkless 3.x (Polars Python): github.com/eddiethedean/sparkless