Parity Status (PySpark vs Robin Sparkless)¶

This doc is the living parity matrix for robin-sparkless and the Sparkless Python package.

Summary (4.13.x)¶

Metric	Value
Last updated	July 2026 (release 4.13.2)
Main pytest suite	3100+ tests passing (CI on `main`; run `pytest tests -n 12`)
Parity JSON fixtures	212+ hand-written fixtures in phases A–G (`make test-parity-phases`)
Oracle	PySpark 3.5 default; PySpark 4.1 opt-in (compat profiles)
CI	GitHub Actions — format, clippy, audit, Rust + Python tests

Quick links: PySpark differences · Deferred scope · Before you adopt

Harness details¶

Oracle: PySpark (fixtures generated by tests/gen_pyspark_cases.py); 4.9.0+ adds opt-in PySpark 4.1 oracle via tests/requirements-pyspark4.txt and nightly workflow .github/workflows/pyspark4-oracle.yml
Compat profiles: Default oracle remains PySpark 3.5 / compat=3.5; PySpark 4 tests use SPARKLESS_PYSPARK_COMPAT=4.0 — see PYSPARK_COMPAT_PROFILES.md
Harness: pytest tests/parity/ (and issue-specific tests under tests/dataframe/, tests/sql/). The legacy Rust integration harness tests/parity.rs was removed; use make test-parity-phases.
Fixtures: tests/fixtures/*.json (operations format); tests/fixtures/plans/*.json (plan format, see LOGICAL_PLAN_FORMAT.md); tests/fixtures/phase_manifest.json (phase-to-fixture mapping)
Sparkless integration: Robin-sparkless powers Sparkless v4 (PyO3). Historical Sparkless 3.x expected_outputs can be converted to robin-sparkless fixtures. See SPARKLESS_INTEGRATION_ANALYSIS.md §4.

Historical phase completion notes (maintainer reference)

Phases A–G and 12–25 completed through **4.13.x**: DataFrameReader/Writer, SparkSession/Catalog stubs, string/array/map/datetime/window functions, plan interpreter, crypto/JVM stubs, signature alignment, and gap closure (bitmap, intervals, cube/rollup, etc.). Fixture `with_rand_seed.json` is marked `"skip": true` (non-deterministic seed parity). Remaining maintainer work: ongoing parity fixes and PySpark 4 profile expansion — see [PYSPARK_4_PARITY_PLAN.md](PYSPARK_4_PARITY_PLAN.md).

Phase test coverage¶

Parity fixtures are grouped into phases (A–G) defined in tests/fixtures/phase_manifest.json. Run phase-specific tests:

make test-parity-phases    # pytest tests/parity/
# Full suite (includes delta/integration markers): see docs/TESTING_GUIDE.md
pytest tests -n 10

Python phase smoke tests: test_phase_a_signature_alignment, test_phase_b_functions, test_phase_c_reader_writer, test_phase_d_dataframe_methods, test_phase_e_spark_session_catalog, test_phase_f_behavioral. When adding new fixtures, add the fixture name to the appropriate phase in phase_manifest.json. See TEST_CREATION_GUIDE.md for phase testing details.

Legend¶

✅ Covered: Covered by one or more fixtures (listed)
🚧 Not yet covered: Supported/partially supported but missing fixture coverage
❌ Not implemented: Not implemented in the Rust API yet
⚠️ Diverges: Implemented but intentionally differs from PySpark (must be documented)

Coverage Matrix (high level)¶

Area	Capability	Status	Fixtures
Data creation	`SparkSession::create_dataframe` (simple rows)	✅ Covered	`filter_age_gt_30`, `groupby_count`, `groupby_with_nulls` (and most others)
Data creation	`SparkSession::create_dataframe_from_rows` (arbitrary schema)	✅ Covered	Used by plan interpreter; plan fixtures
Plan execution	`execute_plan` (serialized logical plan)	✅ Covered	`tests/fixtures/plans/filter_select_limit`, `join_simple`, `with_column_functions` (plan_parity_fixtures)
IO	`read_csv`	✅ Covered	`read_csv`
IO	`read_parquet`	✅ Covered	`read_parquet`
IO	`read_json`	✅ Covered	`read_json`
IO	`spark.read().option/options().csv` (reader options)	✅ Covered	`read_csv_with_options`
IO	`spark.read().table(name)` (temp view)	✅ Covered	`read_table`
DataFrame	`select`	✅ Covered	many (e.g. `filter_age_gt_30`)
DataFrame	`filter` basic comparisons	✅ Covered	`filter_age_gt_30`
DataFrame	`filter` nested boolean logic	✅ Covered	`filter_and_or`, `filter_nested`, `filter_not`
DataFrame	`orderBy`	✅ Covered	many (e.g. `filter_age_gt_30`, `groupby_count`)
GroupBy	`groupBy(...).count()`	✅ Covered	`groupby_count`, `groupby_with_nulls`
GroupBy	`groupBy(...).sum()`	✅ Covered	`groupby_sum`
GroupBy	`groupBy(...).avg()`	✅ Covered	`groupby_avg`
GroupBy	`groupBy(...).min()`	✅ Covered	`groupby_min`
GroupBy	`groupBy(...).max()`	✅ Covered	`groupby_max`
GroupBy	groupBy with NULL keys	✅ Covered	`groupby_null_keys`
GroupBy	groupBy single-row groups / single group	✅ Covered	`groupby_single_row_groups`, `groupby_single_group`
GroupBy	multi-agg `agg([..])`	✅ Covered	`groupby_multi_agg`
GroupBy	stddev, variance, count_distinct in agg	✅ Covered	`groupby_stddev_count_distinct`
DataFrame	`withColumn` (arithmetic)	✅ Covered	`type_coercion_mixed`
DataFrame	`withColumn` (logical/boolean)	✅ Covered	`with_logical_column`
DataFrame	`withColumn` (mixed arithmetic + comparison)	✅ Covered	`with_arithmetic_logical_mix`
Functions	`when().then().otherwise()`	✅ Covered	`when_otherwise`, `when_then_otherwise`
Functions	`coalesce()`	✅ Covered	`coalesce`
Null semantics	NULL equality/inequality	✅ Covered	`null_comparison_equality`
Null semantics	NULL ordering comparisons	✅ Covered	`null_comparison_ordering`
Null semantics	`eqNullSafe`	✅ Covered	`null_safe_equality`
Null semantics	NULLs inside filter predicates	✅ Covered	`null_in_filter`
Type coercion	numeric comparison coercion (int vs double)	✅ Covered	`type_coercion_numeric`
Type coercion	numeric arithmetic coercion (int + double)	✅ Covered	`type_coercion_mixed`
Joins	inner/left/right/outer joins	✅ Covered	`inner_join`, `left_join`, `right_join`, `outer_join`
Joins	join with NULL keys (inner: nulls excluded)	✅ Covered	`join_null_keys`
Joins	join with duplicate keys (cartesian match)	✅ Covered	`join_duplicate_keys`
Windows	row_number, rank, dense_rank, lag, lead	✅ Covered	`row_number_window`, `rank_window`, `lag_lead_window`
Strings	upper, lower, substring, concat, concat_ws	✅ Covered	`string_upper_lower`, `string_substring`, `string_concat`
Strings	length, trim, ltrim, rtrim, regexp_extract, regexp_replace, split, initcap	✅ Covered	`string_length_trim`
Config	`spark.sql.caseSensitive` (case-insensitive column resolution)	✅ Covered	`case_insensitive_columns`
DataFrame	`union` / `unionAll`	✅ Covered	`union_all`
DataFrame	`unionByName`	✅ Covered	`union_by_name`
DataFrame	`distinct` / `dropDuplicates`	✅ Covered	`distinct`
DataFrame	`drop` (columns)	✅ Covered	`drop_columns`
DataFrame	`dropna`	✅ Covered	`dropna`
DataFrame	`fillna` (single value)	✅ Covered	`fillna`
DataFrame	`limit`	✅ Covered	`limit`
DataFrame	`withColumnRenamed`	✅ Covered	`with_column_renamed`
Array/List	array, array_contains, element_at, size/array_size, array_join, array_sort, array_slice, explode; array_position, array_remove, posexplode (implemented)	✅ Covered	`array_contains`, `element_at`, `array_size`
Windows	first_value, last_value, percent_rank	✅ Covered	`first_value_window`, `last_value_window`, `percent_rank_window`
Windows	cume_dist, ntile, nth_value	✅ Covered	`cume_dist_window`, `ntile_window`, `nth_value_window` (multi-step workaround in harness)
Strings	regexp_extract_all, regexp_like	✅ Covered	`regexp_extract_all`, `regexp_like`
Strings	repeat, reverse, instr, lpad, rpad	✅ Covered	`string_repeat_reverse`, `string_lpad_rpad`
Strings	mask, translate, substring_index; soundex, levenshtein, crc32, xxhash64 (Phase 8)	✅ Covered	`string_mask`, `string_translate`, `string_substring_index`, `string_soundex`, `string_levenshtein`, `string_crc32`, `string_xxhash64`
Strings (Phase 13)	ascii, format_number, overlay, position, char, chr, base64, unbase64, sha1, sha2, md5	✅ Implemented	`string_ascii`, `string_format_number`
Strings (Phase 16)	regexp_count, regexp_instr, regexp_substr, split_part, find_in_set, format_string, printf	✅ Covered	`regexp_count`, `regexp_substr`, `regexp_instr`, `split_part`, `find_in_set`, `format_string`
Datetime (Phase 17)	unix_timestamp, from_unixtime, make_date, timestamp_seconds/millis/micros, unix_date, date_from_unix_date	✅ Covered	`unix_timestamp`, `from_unixtime`, `make_date`, `timestamp_seconds`, `timestamp_millis`, `timestamp_micros`, `unix_date`, `date_from_unix_date`
Math (Phase 17)	pmod, factorial	✅ Covered	`pmod`, `factorial`
Array	array_sum, array_exists, forall, filter, transform; array_flatten, array_repeat (Phase 8); array_compact (Phase 13)	✅ Implemented	`array_sum`
Map	create_map, map_keys, map_values, map_entries, map_from_arrays (Phase 8)	✅ Implemented	No fixture yet
JSON	get_json_object, from_json, to_json (Phase 10)	✅ get_json_object covered	`json_get_json_object`
Math	sqrt, pow, exp, log	✅ Covered	`math_sqrt_pow`
GroupBy	first, last, approx_count_distinct in agg	✅ Covered	`groupby_first_last`
GroupBy (Phase 19)	any_value, bool_and, bool_or, product, collect_list, collect_set, count_if, percentile, max_by, min_by	✅ Covered	`groupby_any_value`, `groupby_product`
Misc (Phase 19)	try_divide, try_add, try_subtract, try_multiply, width_bucket, elt, bit_length, typeof	✅ Covered	`try_divide`, `width_bucket`
DataFrame	replace, crossJoin, describe, subtract, intersect	✅ Covered	`replace`, `cross_join`, `describe`, `subtract`, `intersect`
SQL	`SparkSession::sql()` (optional `sql` feature)	✅ Implemented	No fixture (SQL translated to DataFrame ops; parity via DataFrame fixtures)
Datetime	year, month, day, to_date, date_format; current_date, date_add, hour, etc.	✅ Covered	`date_add_sub`, `datediff`, `datetime_hour_minute`
DataFrame (Phase 12)	first, head, offset, sample, to_json, summary, stat, select_expr, freq_items, crosstab, melt, etc. (Rust + PyO3)	✅ first/head/offset/summary covered	`first_row`, `head_n`, `offset_n`, `summary`; additional Phase 12 ops implemented, fixtures TBD
DataFrame (Phase D)	createOrReplaceTempView, corr(col1,col2), cov(col1,col2), toDF/toJSON/toPandas, columns, cache, hint, repartitionByRange, sortWithinPartitions, sameSemantics, semanticHash, isLocal, inputFiles, writeTo (stub)	✅ Implemented	Python: `test_phase_d_dataframe_methods`; table read via `read_table` fixture

Fixture Index¶

Fixture	What it covers
`filter_age_gt_30`	Filter + select + orderBy (baseline)
`filter_and_or`	AND/OR precedence + parentheses
`filter_nested`	Nested boolean logic
`filter_not`	NOT / negation
`groupby_count`	groupBy + count + orderBy
`groupby_with_nulls`	groupBy with NULLs
`groupby_sum`	groupBy + sum
`groupby_avg`	groupBy + avg
`groupby_min`	groupBy + min
`groupby_max`	groupBy + max
`groupby_null_keys`	groupBy with NULL keys
`groupby_single_row_groups`	groupBy with single-row groups (each key once)
`groupby_single_group`	groupBy with single group (all same key)
`join_null_keys`	inner join with NULL join keys (nulls excluded)
`join_duplicate_keys`	inner join with duplicate keys (multiple matches)
`case_insensitive_columns`	case-insensitive column resolution (filter/select/orderBy with mixed-case names)
`read_csv`	CSV read path + operations
`read_parquet`	Parquet read path + operations
`read_json`	JSON read path + operations
`read_csv_with_options`	spark.read.option("header","true").csv(path) with reader_options
`read_table`	spark.read.table("name") via table_source (temp view)
`with_logical_column`	Logical columns/expressions in withColumn
`with_arithmetic_logical_mix`	Mixed arithmetic + comparison in withColumn
`when_otherwise`	when/then/otherwise
`when_then_otherwise`	chained when
`coalesce`	coalesce null handling
`null_comparison_equality`	NULL equality/inequality semantics
`null_comparison_ordering`	NULL ordering semantics
`null_safe_equality`	eqNullSafe semantics
`null_in_filter`	NULLs in filter predicates
`type_coercion_numeric`	int/double comparison coercion
`type_coercion_mixed`	int+double arithmetic coercion
`inner_join`	inner join on dept_id
`left_join`	left join + orderBy
`right_join`	right join + orderBy
`outer_join`	outer join + orderBy
`groupby_multi_agg`	groupBy + multiple aggregations in one agg()
`groupby_stddev_count_distinct`	groupBy + stddev and count_distinct in agg
`row_number_window`	row_number() over partition by dept order by salary desc
`rank_window`	rank() over partition with ties
`lag_lead_window`	lag and lead over partition
`string_upper_lower`	upper(), lower()
`string_substring`	substring() 1-based
`string_concat`	concat(), concat_ws()
`string_length_trim`	length(), trim() in withColumn
`union_all`	union (vertical stack, same schema)
`union_by_name`	unionByName (align columns by name)
`distinct`	distinct (drop duplicate rows)
`drop_columns`	drop(columns)
`dropna`	dropna (drop rows with nulls)
`fillna`	fillna (fill nulls with value)
`limit`	limit(n)
`with_column_renamed`	withColumnRenamed(old, new)
`array_contains`	split + array_contains(col, lit)
`element_at`	split + element_at(col, 1-based index)
`array_size`	split + size(col)
`first_value_window`	first_value over partition
`last_value_window`	last_value over partition
`percent_rank_window`	percent_rank over partition
`cume_dist_window`	cume_dist over partition
`ntile_window`	ntile(n) over partition
`nth_value_window`	nth_value over partition
`regexp_like`	regexp_like(col, pattern) boolean match
`regexp_extract_all`	regexp_extract_all(col, pattern) list of matches
`string_repeat_reverse`	repeat(col, n), reverse(col)
`string_lpad_rpad`	lpad(col, len, pad), rpad(col, len, pad)
`math_sqrt_pow`	sqrt(col), pow(col, exp)
`groupby_first_last`	groupBy + first(name), last(name)
`groupby_any_value`	groupBy + any_value(column)
`groupby_product`	groupBy + product(column)
`try_divide`	try_divide(col, col) — null on divide-by-zero
`width_bucket`	width_bucket(value, min, max, num_bucket)
`cross_join`	crossJoin (cartesian product)
`describe`	describe() summary statistics
`summary`	summary() (same as describe)
`replace`	replace(column, old_value, new_value)
`subtract`	subtract (set difference)
`intersect`	intersect (set intersection)
`first_row`	first() – first row as one-row DataFrame
`head_n`	head(n) – first n rows
`offset_n`	offset(n) – skip first n rows
`string_mask`	mask(col) – replace upper/lower/digit with X/x/n
`string_translate`	translate(col, from_str, to_str)
`string_substring_index`	substring_index(col, delim, count) before/after nth delim
`array_sum`	array(cols) + array_sum(col)
`json_get_json_object`	get_json_object(col, '$.path')
`date_add_sub`	date_add(col('d'), 7), date_sub(col('d'), 3)
`datediff`	datediff(col('end'), col('start'))
`datetime_hour_minute`	hour(col('ts')), minute(col('ts')) with timestamp input
`string_soundex`	soundex(col('name'))
`string_levenshtein`	levenshtein(col('a'), col('b'))
`string_crc32`	crc32(col('s'))
`string_xxhash64`	xxhash64(col('s'))
`string_ascii`	ascii(col('name')) → first-char code point
`string_format_number`	format_number(col('value'), 2) → fixed-decimal string
`phase15_aliases_nvl_isnull`	nvl, nvl2, isnull, isnotnull (Phase 15)
`string_left_right_replace`	left, right, replace, startswith, endswith, contains, like, ilike, rlike
`math_cosh_cbrt`	cosh, sinh, tanh, acosh, asinh, atanh, cbrt, expm1, log1p, log10, log2, rint, hypot
`array_distinct`	array_distinct(col) — JSON fixture may be skipped; Python tests in `tests/dataframe/test_issue_415_array_distinct.py` and `test_issue_439_` run in main suite
`regexp_count`	regexp_count(col, pattern) – count non-overlapping matches
`regexp_substr`	regexp_substr(col, pattern) – first match substring
`regexp_instr`	regexp_instr(col, pattern) – 1-based position of first match
`split_part`	split_part(col, delim, part_num) – 1-based part of split
`find_in_set`	find_in_set(col('str'), col('set')) – 1-based index in comma-delimited list
`format_string`	format_string('%d %s', col('a'), col('b')) – printf-style formatting
`unix_timestamp`	unix_timestamp(col), unix_timestamp(col, format) – string to seconds
`from_unixtime`	from_unixtime(col), from_unixtime(col, format) – seconds to formatted string
`make_date`	make_date(year, month, day) – build date from parts
`timestamp_seconds`	timestamp_seconds(col) – seconds epoch to timestamp
`timestamp_millis`	timestamp_millis(col) – millis epoch to timestamp
`timestamp_micros`	timestamp_micros(col) – micros epoch to timestamp
`unix_date`	unix_date(col) – date to days since epoch
`date_from_unix_date`	date_from_unix_date(col) – days to date
`pmod`	pmod(a, b) – positive modulus
`factorial`	factorial(n) – n! for n 0..20
`with_bit_ops`	bit operations (bit_and, bit_or, bit_xor, bit_count, bit_get) via withColumn

Next additions to the matrix (recommended)¶

Add more join edge-case fixtures (e.g. left/outer with null keys) if needed.
ROADMAP Phases 16–27: Phases 18–19 completed. Phases 20–24 (full parity in 5 parts), Phase 25 (readiness for post-refactor merge), Phase 26 (publish crate on crates.io), Phase 27 (Sparkless integration, 200+ tests). See ROADMAP.md, GAP_ANALYSIS_SPARKLESS_3.28.md.

Sparkless Test Conversion¶

Sparkless (github.com/eddiethedean/sparkless) has 270+ JSON expected outputs in tests/expected_outputs/. These can drive robin-sparkless parity tests via a fixture converter that maps Sparkless JSON format → robin-sparkless fixture format. See SPARKLESS_INTEGRATION_ANALYSIS.md §4 for: - Fixture format comparison (input_data vs input/rows; expected_output vs expected) - Conversion steps per test - Priority order: parity/dataframe, parity/functions, then parity/sql