Skip to content

PySpark 4 Parity Plan (with PySpark 3 Backwards Compatibility)

Status: Living plan · Last updated: May 2026 (breaking-changes research incorporated)
Audience: Maintainers of sparkless (Python) and robin-sparkless (Rust engine)
Related: FULL_PARITY_ROADMAP.md, PARITY_STATUS.md, PYSPARK_DIFFERENCES.md, PYSPARK_VERSION_NOTES.md, DEFERRED_SCOPE.md, python_migration.md


1. Goals

Goal Definition of done
PySpark 4 parity For in-scope APIs, Sparkless behavior matches PySpark 4.0–4.2 (latest stable 4.x) under documented configuration, with fixture and issue-test coverage.
PySpark 3 backwards compatibility Code written for PySpark 3.2–3.5 continues to work on Sparkless without changes, except for documented intentional differences and opt-in PySpark 4 semantics.
Sustainable maintenance One test suite, two oracle modes (3.x and 4.x), automated API/signature diffing, and explicit compatibility tiers in docs and CI.

Non-goals (unchanged; see DEFERRED_SCOPE.md): RDD, distributed execution, full Structured Streaming, Mesos/K8s cluster semantics, full Hive metastore DDL, XML/XPath (unless promoted), sketch/HLL aggregates, UDTF.

Naming note: The Python package Sparkless v4 (pip install sparkless>=4) refers to the Rust-backed product version, not Apache PySpark 4. This plan is about Apache PySpark 4.x API and behavior.


2. Current baseline (May 2026)

Dimension Today
Parity oracle (fixtures) PySpark 3.5.x (tests/gen_pyspark_cases.py, tests/requirements-pyspark.txt)
Parity fixtures 212 JSON fixtures passing (PARITY_STATUS.md)
Dual-mode pytest sparkless (default) vs SPARKLESS_TEST_BACKEND=pyspark
Version matrix (Docker) Python 3.9–3.13 × PySpark 3.2, 3.3, 3.4, 3.5 (tests/compatibility_matrix/)
API gap vs PySpark repo ~120 functions/methods still partial or missing (FULL_PARITY_ROADMAP.md, ROBIN_SPARKLESS_MISSING.md)
Signature alignment Largely complete for 3.5 (SIGNATURE_GAP_ANALYSIS.md); re-baseline needed for 4.x
Documented divergences PYSPARK_DIFFERENCES.md

PySpark 4.x is not yet a CI oracle or matrix column. Closing API gaps from the 3.5-era roadmap remains prerequisite work shared by both targets.


3. Compatibility model

3.1 Two axes

  1. API surface — names, signatures, presence of methods (createDataFrame, read.jdbc, F.try_divide, etc.).
  2. Runtime semantics — null handling, overflow, cast rules, type coercion, collect() shapes, map key normalization.

PySpark 4 changes both. Sparkless must separate “works on 3.x code” from “matches 4.x semantics.”

3.2 Compatibility tiers (target policy)

Tier PySpark versions Sparkless default Use case
A — Legacy 3.x 3.2–3.5 Yes (initial default) Existing tests and apps migrated from PySpark 3 / Sparkless 3.x
B — Transitional 3.5 + 4.x with explicit config Opt-in Teams validating Spark 4 before cutover
C — PySpark 4 4.0–4.2 Future default (major release) New projects targeting Spark 4

Principle: Do not silently change Tier A behavior when implementing Tier C. Use configuration profiles, not breaking changes in patch releases.

3.3 Proposed configuration surface

Introduce a single session-level profile (names are illustrative; finalize in implementation):

# Tier A (default): PySpark 3.5-like semantics where Sparkless differs from 4.0
spark.conf.set("sparkless.pyspark.compat", "3.5")

# Tier B/C: PySpark 4.0+ semantics for ANSI, map keys, collect intervals, etc.
spark.conf.set("sparkless.pyspark.compat", "4.0")

Mirror important Spark 4 configs when they affect local engine behavior:

Config PySpark 3.5 typical PySpark 4.0 default Sparkless action
spark.sql.ansi.enabled false true Honor in expression engine when compat=4.0; default false for compat=3.5
spark.sql.storeAssignmentPolicy ANSI (3.x) ANSI Document; align insert/cast strictness per profile
spark.sql.legacy.* various various Map only flags that affect local SQL/expressions; document no-ops for cluster-only flags
spark.sql.pyspark.legacy.inferMapTypeFromFirstPair.enabled N/A (4.0+) false Implement map schema inference policy for create_map / map_from_arrays
PYSPARK_YM_INTERVAL_LEGACY N/A env 1 restores 3.x collect Match collect() interval shapes per profile
spark.sql.legacy.disableMapKeyNormalization N/A (4.0+) false (normalize -0.0) Map key behavior when compat=4.0
spark.sql.pyspark.legacy.inferMapTypeFromFirstPair.enabled N/A true restores 3.x inference Map type inference in create_map / PySpark map columns
spark.sql.pyspark.legacy.inferArrayTypeFromFirstElement.enabled 3.4+ true restores first-element only Array schema inference (3.4 change; still relevant)
SPARK_ANSI_SQL_MODE / SPARK_SQL_LEGACY_CREATE_HIVE_TABLE env mirrors env mirrors Document for users mirroring cluster configs

Environment override for CI: SPARKLESS_PYSPARK_COMPAT=3.5|4.0.

When sparkless.pyspark.compat is set, Sparkless should apply the bundle of defaults above (not only spark.sql.ansi.enabled) so one knob matches user expectations.


4. Research sources (authoritative)

Source URL Use in this plan
Upgrading PySpark migration_guide/pyspark_upgrade Python API removals, pandas-on-Spark, collect/import changes (3.5→4.0, 4.0→4.1)
Spark SQL migration guide sql-migration-guide.md SQL/expression semantics, JDBC, functions (3.5→4.0, 4.0→4.1)
ANSI compliance sql-ref-ansi-compliance Default-on ANSI: arithmetic, cast, division, try_* guidance
Spark 4.0.0 release notes spark-release-4-0-0 VARIANT, SQL UDFs, new DataFrame APIs, dependency baselines

5. Breaking changes catalog (PySpark 3 → 4)

Each row is tagged for Sparkless planning:

  • Relevance: HIGH = must implement or emulate per compat profile · MED = implement if feature exists · LOW = document / no-op · N/A = cluster/JVM only
  • Tier: which compat profile first targets the 4.x behavior (4.0 / 4.1 / both)

5.1 Spark SQL semantics (3.5 → 4.0) — HIGH priority for engine

These come from the Spark SQL 3.5 → 4.0 migration guide and ANSI compliance.

Change PySpark 3.5 PySpark 4.0 Legacy restore Sparkless
ANSI mode default spark.sql.ansi.enabled=false true Set config or SPARK_ANSI_SQL_MODE=false HIGH · Phase 4A · overflow/cast/divide throw vs null
Arithmetic overflow Wraps (e.g. INT_MAX+1 → negative) Exception (or use try_add, etc.) ansi.enabled=false HIGH · same
Invalid CAST Often NULL Exception (or try_cast) ansi.enabled=false HIGH · str→int, numeric narrowing
Division by zero NULL in many paths Exception (or try_divide) ansi.enabled=false HIGH
Store assignment spark.sql.storeAssignmentPolicy=ANSI (3.x) Still ANSI default legacy / strict policies MED · insert/cast strictness if we add INSERT
Map key -0.0 No normalization Normalize to 0.0 in create_map, map_from_arrays, map_from_entries, map_concat spark.sql.legacy.disableMapKeyNormalization=true HIGH · Phase 4B
Map schema inference First non-null pair (PySpark) Merge all pairs spark.sql.pyspark.legacy.inferMapTypeFromFirstPair.enabled=true HIGH · Phase 4B
Array schema inference (3.4+, still relevant) First element Merge all elements spark.sql.pyspark.legacy.inferArrayTypeFromFirstElement.enabled=true MED · createDataFrame / arrays
encode / decode charsets JDK charsets Only US-ASCII, ISO-8859-1, UTF-8/16/32 variants spark.sql.legacy.javaCharsets=true MED · we implement UTF-8/hex; audit others
encode / decode errors Mojibake replacement MALFORMED_CHARACTER_CODING spark.sql.legacy.codingErrorAction=true MED
format_string indexes 0$ allowed historically 1-based only (1$, 2$, …) spark.sql.legacy.allowZeroIndexInFormatString (deprecated) MED · already partially aligned
array_insert negative index Legacy index rules 1-based; -1 inserts at end (3.5+) spark.sql.legacy.negativeIndexInArrayInsert=true LOW · verify current behavior
Timestamp → int cast overflow Wrapping NULL (non-ANSI path) MED · cast path
Time parser policy default EXCEPTION CORRECTED MED · to_timestamp / parsing
CTE precedence EXCEPTION on conflict CORRECTED (inner wins) spark.sql.legacy.ctePrecedencePolicy=EXCEPTION LOW · SQL parser
SQL ! as NOT Bug allowed ! IN, ! BETWEEN Syntax error spark.sql.legacy.bangEqualsNot=true LOW · SQL only
sentences() locale Locale.US when country null Locale(language) LOW · if implemented
Parquet/ORC defaults ORC snappy ORC zstd spark.sql.orc.compression.codec=snappy N/A · write paths
Datetime rebasing config names spark.sql.legacy.parquet.* Renamed to spark.sql.parquet.* Use new names LOW · document for users
CREATE TABLE default provider Hive spark.sql.sources.default spark.sql.legacy.createHiveTableByDefault=true LOW · SQL DDL subset

ANSI try_* functions (many added in 3.5, behavior clarified under ANSI-on in 4.x): try_add, try_subtract, try_multiply, try_divide, try_cast, try_to_timestamp, try_to_number, try_aes_decrypt, etc. Under ANSI-on, these return NULL on failure instead of throwing. Sparkless should implement these as null-propagating variants regardless of profile, and ensure they match 4.x when compat=4.0 and ansi.enabled=true. See try_divide.

Function behavior changes (4.0 highlights from release notes):

Function / area Change Sparkless
mode() New deterministic arg (4.0) MED · signature + behavior
array_insert() 1-based negative indexes LOW
encode/decode Stricter charset + errors MED
to_csv Arrays/maps/binary as pretty strings LOW
parse_json / VARIANT New in Spark 4.0 Defer VARIANT; evaluate parse_json
listagg, mergeInto, groupingSets, lateral joins New DataFrame/SQL APIs MED · per ROBIN_SPARKLESS_MISSING

5.2 PySpark Python API (3.5 → 4.0)

From Upgrading PySpark — 3.5 to 4.0.

Environment and dependencies

Change Detail Sparkless
Python 3.8 dropped 4.0+ requires newer Python LOW · document; CI 3.10+ for PySpark 4 oracle
Pandas ≥ 2.0 Was ≥ 1.0.5 LOW · toPandas tests only
NumPy ≥ 1.21 Was ≥ 1.15 LOW
PyArrow ≥ 11.0 Was ≥ 4.0.0 MED · createDataFrame(pyarrow.Table) target
JDK for PySpark install Eliminated in 4.0 for pip install (cluster still needs Java) N/A · Sparkless has no JVM

pyspark.sql / DataFrame (in scope)

Change Detail Sparkless
from pyspark.sql.functions import * No longer exports DataFrame, Column, StructType, etc. LOW · we already use explicit imports in docs
createDataFrame(pyarrow.Table) Supported in 4.0 HIGH · CREATEDATAFRAME_GAPS
Map schema inference Merge all dict pairs (see §5.1) HIGH
collect() + YearMonthIntervalType No longer raw integers PYSPARK_YM_INTERVAL_LEGACY=1
dropDuplicates / dropDuplicatesWithinWatermark Accept var-args columns MED · signature
DataFrame.mergeInto New in 4.0 MED · defer or stub
groupingSets DataFrame API MED
Lateral join DataFrame API New LOW · defer
parse_json column API New MED
VariantVal / VARIANT in UDFs New Defer
DataType.fromDDL New MED · overlaps DDL parser work
Parameterized spark.sql(..., args=) 3.4+ named params LOW · SQL subset
Column accepts Python Enum 4.0 LOW
Bare literals in Column & / \| 4.0 LOW

Pandas API on Spark (pyspark.pandas) — out of scope

Large removal/rename surface (Koalas removal, to_koalaspandas_api, iteritemsitems, many parameter renames, assertPandasOnSparkEqual removed, etc.). Sparkless does not implement pyspark.pandas. Document that users migrating only classic pyspark.sql are unaffected; pandas-on-Spark migrations are a separate project.

Notable 4.0 pandas-on-Spark interaction: raises if Spark runs with ANSI on unless compute.fail_on_ansi_mode=False or ANSI disabled — irrelevant to Sparkless unless we add pandas API later.

5.3 PySpark / Spark SQL (4.0 → 4.1)

From Upgrading PySpark — 4.0 to 4.1 and SQL migration guide § “3.5 to 4.0” successor.

Change Detail Sparkless
Python 3.9 dropped 4.1+ LOW · matrix / CI
PyArrow ≥ 15.0 Was 11.0 MED · Arrow createDataFrame tests
Pandas ≥ 2.2 Was 2.0 LOW
Spark Connect: DataFrame['col'] No longer eagerly validates name PYSPARK_VALIDATE_COLUMN_NAME_LEGACY=1
Arrow UDF: UDT support Was fallback spark.sql.execution.pythonUDF.arrow.legacy.fallbackOnUDT
Arrow/pandas conversion removed Type coercion changes legacy pandas conversion configs
BinaryType → Python bytes Default in 4.1 (was bytearray in many paths) spark.sql.execution.pyspark.binaryAsBytes=false
convertToArrowArraySafely default true Overflow/truncation errors in Arrow set to false to restore
pandas-on-Spark ANSI compute.ansi_mode_support=True default
Parquet struct nullness (SQL 4.1) No longer assume all-null struct if fields missing spark.sql.legacy.parquet.returnNullStructIfAllFieldsMissing=true
Thrift Server column ordinal (4.1) 1-based ORDINAL fix legacy hive thrift config
Log4j 1 → 2 (Spark 4.1) Cluster logging N/A

5.4 JDBC type mapping (3.5 → 4.0) — MED for JDBC feature

If Sparkless JDBC is in use, reproduce per datasource with profile compat=4.0 and legacy flags:

Datasource Change (summary) Legacy flag
PostgreSQL Read/write TIMESTAMP WITH TIME ZONE vs NTZ rules spark.sql.legacy.postgres.datetimeMapping.enabled
MySQL TIMESTAMP→TimestampType; SMALLINT→Short; FLOAT→Float; BIT(n>1)→Binary; write Short as SMALLINT; NTZ write as DATETIME spark.sql.legacy.mysql.*
Oracle Timestamp write as TIMESTAMP WITH LOCAL TIME ZONE spark.sql.legacy.oracle.timestampMapping.enabled
SQL Server TINYINT→Short; DATETIMEOFFSET→Timestamp spark.sql.legacy.mssqlserver.*
DB2 SMALLINT→Short; Boolean write as BOOLEAN spark.sql.legacy.db2.*

Phase: add JDBC regression pack tagged pyspark4_only + compat=4.0 (extends existing testcontainers tests).

5.5 Cluster / runtime changes — document only (N/A for engine)

Change Note
Java 17+ default (Spark 4.0) PySpark 4 oracle CI should use Java 17
Scala 2.13 only N/A
Mesos removed N/A
ANSI default Emulated via config (§3.3)
Hive < 2.0 dropped N/A
Structured Streaming Trigger.Once deprecated Stubs remain
Spark Connect / pyspark-client Separate distribution; optional future testing

5.6 Spark 4.0 new capabilities (parity backlog, not all “breaking”)

From Spark 4.0.0 release notes — track in ROBIN_SPARKLESS_MISSING:

Feature Parity priority
VARIANT type + semi-structured SQL Defer / stub unless Polars path exists
SQL user-defined functions Partial (session UDFs exist); align SQL CREATE FUNCTION
Session variables, pipe syntax, collations LOW / defer for local SQL
Built-in XML datasource Defer (DEFERRED_SCOPE)
parse_json, to_json enhancements MED
DataFrame.mergeInto, writeTo / DSv2 MED / stub
Python Data Source API N/A (cluster)
Python UDTF Defer
PySpark Plotting API N/A
applyInArrow on groupBy/cogroup Defer
Time travel on df.read LOW

5.7 Summary matrix — what to implement first

Priority Items
P0 (blocking 4.0 semantics) ANSI default bundle, map keys + inference, interval collect, try_* under ANSI-on
P1 (common API gaps) PyArrow createDataFrame, JDBC 4.0 mappings, encode/decode strictness
P2 (4.0 API additions) mergeInto, groupingSets, parse_json, mode(deterministic=), var-args dropDuplicates
P3 (4.1 polish) BinaryTypebytes, Arrow safe conversion defaults
Defer VARIANT, pyspark.pandas, streaming, RDD, XML datasource, Python Data Source

6. PySpark 3 vs 4 — Sparkless impact (condensed)

Section §5 is the researched catalog; this table is the maintainer quick view.

Area PySpark 3.x PySpark 4.x Sparkless phase
ANSI SQL Off by default On by default 4A
Map keys / inference Legacy Normalized keys; merged schemas 4B
collect() intervals Legacy integers New types / env legacy 4B
try_* 3.5+ Required for ANSI-on ergonomics 4A + existing functions
JDBC types 3.5 mappings Per-DB breaking mappings 4C + JDBC tests
PyArrow Table 4.0+ Supported P1 / CREATEDATAFRAME
Wildcard import Broader import * Functions only Docs only
Python 3.8+ (3.5) 3.10+ (4.1) CI matrix

API surface audit (unchanged process)

Run periodic extraction (extend scripts/extract_pyspark_tests.py / gap scripts):

Known 4.x surface still open: see §5.6 and ROBIN_SPARKLESS_MISSING.md, CREATEDATAFRAME_GAPS.md.


7. Testing strategy

5.1 Oracle dual-track

Track PySpark install Purpose
Primary (keep) pyspark>=3.5,<3.6 Regression guard for Tier A; existing 212 fixtures
Secondary (add) pyspark>=4.0,<4.2 on Python 3.10+ Tier B/C oracle; new fixtures for 4-only behavior

Deliverables:

  1. tests/requirements-pyspark4.txt — PySpark 4 + delta-spark>=4.0,<5.
  2. CI job: SPARKLESS_PYSPARK_COMPAT=4.0 SPARKLESS_TEST_BACKEND=pyspark pytest tests/parity tests/dataframe -m "not jdbc" (scope TBD).
  3. Extend tests/gen_pyspark_cases.py to accept --pyspark-version 3.5|4.0 and tag fixture metadata.

5.2 Compatibility matrix

Extend run_matrix_tests.py:

Python PySpark Status
3.9–3.13 3.2–3.5 Keep (Tier A)
3.10–3.13 4.0–4.2 Add (Tier B); drop 3.9 for this row

5.3 Test taxonomy (markers)

Define pytest markers (document in TESTING_GUIDE.md):

  • pyspark3_only — behavior differs on 4.x; skip when oracle is 4.0.
  • pyspark4_only — requires ANSI-on or 4.x APIs; skip when oracle is 3.5.
  • compat_profile("3.5") / compat_profile("4.0") — run with sparkless.pyspark.compat set.

Rule: Every semantic change for PySpark 4 ships with at least one fixture or issue test per profile.

5.4 sparkless.testing dual mode

Extend sparkless.testing (see TESTING_GUIDE.md):

# Today
SPARKLESS_TEST_BACKEND=pyspark pytest ...

# Proposed
SPARKLESS_TEST_BACKEND=pyspark SPARKLESS_PYSPARK_COMPAT=4.0 pytest ...
SPARKLESS_TEST_BACKEND=sparkless SPARKLESS_PYSPARK_COMPAT=4.0 pytest ...

Comparison helpers should use profile-aware tolerances (e.g. exception vs null under ANSI).


8. Implementation phases

Phases P0–P3 close shared API gaps. Phases 4A–4D are PySpark 4-specific. Phases M1–M3 are maintenance.

P0 — Inventory and gates (2–3 weeks)

Exit criteria: Published delta report; compat config RFC merged; no code behavior change yet.

P1 — Finish shared API parity (ongoing; see FULL_PARITY_ROADMAP)

Continue FULL_PARITY_ROADMAP.md remaining items:

Exit criteria: Gap report “missing” count not regressed; parity fixtures ≥ 212 still green on 3.5 oracle.

P2 — Signature re-baseline for 4.x (2–4 weeks)

  • [ ] Re-run signature gap tooling against PySpark 4.1.
  • [ ] Fix PyO3 #[pyo3(signature = ...)] mismatches in python/src/lib.rs.
  • [ ] Add optional parameters introduced in 4.x without breaking 3.x call sites (defaults preserved).

Exit criteria: ≥ 95% of shared functions exact match on 4.1 introspection; documented exceptions in PYSPARK_DIFFERENCES.md.

P3 — Documentation and user contract (1 week)

Exit criteria: Users can choose Tier A vs B explicitly; no ambiguous “we match PySpark” claims.


4A — ANSI and arithmetic semantics (3–5 weeks)

Owner: robin-sparkless-polars expression + type coercion paths. Scope: §5.1 ANSI rows + try_* functions.

  • [ ] Implement spark.sql.ansi.enabled behavior for: overflow, divide by zero, cast failures, string-to-number parse (align with ANSI compliance subset used in tests).
  • [ ] Default off when compat=3.5; on when compat=4.0.
  • [ ] Port/adapt tests from Spark’s ANSI suites where feasible (Rust unit tests + Python issue tests).

Exit criteria: Curated ANSI fixture pack passes on both profiles; documented divergences ≤ agreed list.

4B — Type system and collect paths (2–4 weeks)

Scope: §5.1 map/interval rows, §5.2 PySpark collect changes.

  • [ ] Map key normalization (-0.0 / 0.0) behind compat=4.0 and spark.sql.legacy.disableMapKeyNormalization.
  • [ ] Map schema inference (first pair vs merge) for create_map / struct maps.
  • [ ] YearMonthIntervalType / DayTimeIntervalType collect shapes per PYSPARK_YM_INTERVAL_LEGACY and compat profile.
  • [ ] Evaluate VARIANT — default: stub + defer in DEFERRED_SCOPE.md unless product priority changes.

Exit criteria: Issue tests for map + interval collect; parity fixtures tagged pyspark4_only where needed.

4C — PySpark 4 oracle CI (2–3 weeks)

  • [ ] tests/requirements-pyspark4.txt + CI job on Ubuntu, Python 3.11, Java 17.
  • [ ] Subset of tests/parity/ and high-value tests/dataframe/test_issue_*.py green against real PySpark 4.1.
  • [ ] Matrix row: Py 3.10–3.13 × PySpark 4.0–4.2.

Exit criteria: CI badge “PySpark 4 oracle”; failure budget documented (target: 0 for parity dir).

4D — Tier C default (major release only)

  • [ ] Sparkless 5.0.0 (see §9.1): default sparkless.pyspark.compat=4.0; update PyPI bound to sparkless>=5,<6.
  • [ ] Migration guide: enable compat=3.5 for one release cycle with deprecation warning.
  • [ ] Changelog breaking section.

Exit criteria: Semver-major release with clear migration path; Tier A available via config for ≥ 12 months.


M1 — Ongoing parity hygiene

  • Weekly: run make gap-analysis-runtime (or successor) on 3.5 and 4.1.
  • Per PR: new PySpark-facing APIs require fixture + PYSPARK_DIFFERENCES.md entry if behavior differs.
  • Per release: refresh PARITY_STATUS.md counts for both oracles.

M2 — Upstream test port

M3 — Issue-driven backlog


9. Release and semver policy

Change type Semver Example
New function matching 3.5 and 4.0 Minor F.new_fn
Bug fix aligning to 3.5 oracle Patch Fix filter null semantics
New opt-in compat=4.0 behavior Minor ANSI throws when enabled
Default switch to PySpark 4 semantics Major Sparkless 5.0 default compat
Remove 3.x-only API Major Only after deprecation

Sparkless package version (4.x) can continue shipping Tier A defaults until Phase 4D.

9.1 Practical version sequence

Align crate (robin-sparkless, robin-sparkless-core, robin-sparkless-polars) and Python (sparkless) versions on the same number. PyPI constraint today: pip install "sparkless>=4,<5"; update to >=5,<6 only when 5.0.0 ships.

Version When Default compat PyPI / notes
4.8.0 (current) Baseline before plan execution Tier A (3.5-like) sparkless>=4,<5
4.9.0 Plan implemented: opt-in PySpark 4 (compat=4.0), ANSI/maps/JDBC/PyArrow, parity oracles; defaults unchanged Tier A Still >=4,<5; market as “PySpark 4 ready (opt-in)”
4.9.x Patches: parity fixes, docs, non-breaking API additions Tier A Patch releases only
4.10.0+ (optional) Further minors if needed before default flip (large API tranches) Tier A Minor only while Tier A remains default
5.0.0 Phase 4D: default sparkless.pyspark.compat=4.0; migration guide; compat=3.5 retained ≥ 12 months with deprecation warning Tier C sparkless>=5,<6; major = behavioral cutover, not “plan checklist done”
4.8.0 (today) → 4.9.0   plan complete, opt-in PySpark 4
              → 4.9.x   patches
              → 4.10+   optional extra minors (still Tier A default)
              → 5.0.0   default compat=4.0 (+ deprecation period for 3.5 default)

Rules of thumb:

  • Finishing the plan (phases P0–4C) → ship 4.9.0, not 5.0.0, unless you also flip defaults in the same release.
  • 5.0.0 is for the behavioral cutover (Tier C), which also helps disambiguate Sparkless v5 (product) from Apache PySpark 4 (compatibility target).
  • Bump robin-sparkless workspace crates and sparkless Python package in lockstep on each release (RELEASING.md).

10. Success metrics

Metric Target (Tier B) Target (Tier C / 4D)
Parity fixtures vs oracle 212+ pass on 3.5 212+ pass on 4.1 (allow additional 4-only fixtures)
tests/parity/ PySpark 4 CI ≥ 95% pass 100% pass
Signature exact match (shared API) ≥ 95% vs 4.1
Compatibility matrix 3.2–3.5 green + 4.0–4.2 green on Py 3.10+
Documented intentional diffs Stable list in PYSPARK_DIFFERENCES Same + profile column

11. Risks and mitigations

Risk Mitigation
ANSI semantics diverge from Polars Implement ANSI layer in expr_ir / coercion, not only Polars defaults; Rust unit tests
Dual oracle doubles CI time Sharded jobs; parity on 4.x nightly, 3.5 on every PR
Users expect cluster Spark 4 Document “local engine subset”; link JVM-only items to DEFERRED_SCOPE
VARIANT / new types blocked on Polars Explicit defer; stub types for parse-only SQL if needed
Confusion: Sparkless v4 vs PySpark 4 Glossary in README and python_migration.md
Breaking-change drift in upstream Spark Pin research to Spark 4.1.x docs; re-diff on minor releases (§4)

  1. P0 — PySpark 4.1 API extraction; cross-link each §5 row to gap tracker / GitHub issues.
  2. P0 — Publish sparkless.pyspark.compat RFC mapping §3.3 + §5.1 legacy flags.
  3. 4A — ANSI bundle (§5.1): overflow, cast, divide + try_* fixtures with ansi.enabled on/off.
  4. 4B — Map keys + inference + interval collect (§5.1–5.2).
  5. P1 — Shared API gaps + PyArrow createDataFrame (§5.2).
  6. 4Crequirements-pyspark4.txt, Java 17 CI oracle, JDBC 4.0 test pack (§5.4).

13. Reference index

Document Role in this plan
FULL_PARITY_ROADMAP.md Shared API phases A–H (mostly 3.5-era)
PARITY_STATUS.md Fixture matrix and phase manifest
PYSPARK_DIFFERENCES.md Intentional divergences; add profile column
PYSPARK_VERSION_NOTES.md PySpark mode test setup
DEFERRED_SCOPE.md Out-of-scope boundaries
TESTING_GUIDE.md Dual-backend testing
GAP_ANALYSIS_PYSPARK_REPO.md Re-run for v4.1
CREATEDATAFRAME_GAPS.md PyArrow / schema gaps
tests/compatibility_matrix/README.md Version matrix extension

External (breaking-change research)


Appendix A — Glossary

Term Meaning
Oracle Real PySpark used to generate expected outputs or run comparison tests
Fixture JSON test case under tests/fixtures/
Profile / compat 3.5 or 4.0 semantic mode for Sparkless
Tier A/B/C Backwards-compat policy levels (§3.2)
Sparkless v4 Python package major version with Rust engine (not PySpark 4)

Appendix B — Suggested CI layout (sketch)

# PR: fast path
- sparkless backend, all tests, compat=3.5 (default)
- rust: make check

# PR: pyspark oracle 3.5 (subset or full)
- SPARKLESS_TEST_BACKEND=pyspark, pyspark 3.5, Java 17

# Nightly
- SPARKLESS_TEST_BACKEND=pyspark, pyspark 4.1, compat=4.0, Python 3.11
- compatibility_matrix including pyspark 4.x row

Maintainers: update the Current baseline table and checkboxes when phases complete.