Skip to content

Parity Status (PySpark vs Robin Sparkless)

This doc is the living parity matrix for robin-sparkless.

  • Oracle: PySpark (fixtures generated by tests/gen_pyspark_cases.py); 4.9.0+ adds opt-in PySpark 4.1 oracle via tests/requirements-pyspark4.txt and nightly workflow .github/workflows/pyspark4-oracle.yml
  • Compat profiles: Default oracle remains PySpark 3.5 / compat=3.5; PySpark 4 tests use SPARKLESS_PYSPARK_COMPAT=4.0 — see PYSPARK_COMPAT_PROFILES.md
  • Harness: pytest tests/parity/ (and issue-specific tests under tests/dataframe/, tests/sql/). The legacy Rust integration harness tests/parity.rs was removed; use make test-parity-phases.
  • Fixtures: tests/fixtures/*.json (operations format); tests/fixtures/plans/*.json (plan format, see LOGICAL_PLAN_FORMAT.md); tests/fixtures/phase_manifest.json (phase-to-fixture mapping)
  • Sparkless integration: Robin-sparkless is designed to replace Sparkless's backend. Sparkless has 270+ expected_outputs; a fixture converter can convert those to robin-sparkless format. See SPARKLESS_INTEGRATION_ANALYSIS.md §4.

Status as of May 2026: Main pytest suite 3115 passed, 64 skipped (pytest tests -n 12). Parity JSON fixtures: 212+ hand-written fixtures in phases A–G; run via make test-parity-phases. Fixture with_rand_seed.json is marked "skip": true in tooling (non-deterministic seed parity). Phase GCOMPLETED: Parity fixture expansion — 201 hand-written fixtures passing (filter_age_lt_25, filter_name_eq, select_single_column, groupby_count_desc, limit_one, orderby_desc, with_column_lit, distinct_all, fillna_simple, filter_then_select, groupby_sum_simple, filter_ge, filter_ne, filter_le, filter_or_simple, filter_eq_lit, select_reorder, and 40+ more added). Phase CCOMPLETED: DataFrameReader/Writer parity — spark.read().option/options/format/load/table/csv/parquet/json; df.write().option/options/partition_by/parquet/csv/json; fixtures read_csv_with_options, read_table. Phase DCOMPLETED: DataFrame method gaps — df.createOrReplaceTempView, df.corr(col1,col2), df.cov(col1,col2), toDF/toJSON/toPandas, columns, cache, hint, repartitionByRange, sortWithinPartitions, sameSemantics, semanticHash, isLocal, inputFiles, writeTo (stub). Phase ECOMPLETED: SparkSession & Catalog stubs — spark.catalog(), spark.conf(), spark.range(), spark.version, spark.newSession(), spark.stop(), spark.getActiveSession(), spark.getDefaultSession(), spark.udf() (stub); Catalog 27 methods (functional: dropTempView, listTables, tableExists, etc.; stubs: cacheTable, createTable, etc.). Gap closure (Feb 2026): bitmap (5), make_dt_interval, make_ym_interval, to_timestamp_ltz/ntz, sequence, shuffle, inline, inline_outer, regr_ (9); DataFrame cube, rollup, write, data, toLocalIterator, persist/unpersist and stubs (rdd, foreach, foreachPartition, mapInPandas, mapPartitions, storageLevel, isStreaming, withWatermark). Signature alignment (optional params and two-arg when): fixtures position_start, assert_true_err_msg, like_escape_char, ilike_escape_char, months_between_round_off, parse_url_key, make_timestamp_timezone, to_timestamp_format, to_char_format, when_two_arg added. Phase 25COMPLETED: Plan interpreter (execute_plan), expression interpreter, LOGICAL_PLAN_FORMAT.md, plan fixtures in tests/fixtures/plans/ (filter_select_limit, join_simple, with_column_functions), plan_parity_fixtures test; create_dataframe_from_rows (Rust + Python). Remaining: Phase 26 (crate publish), Phase 27 (Sparkless integration). Phase 24COMPLETED: bit (bit_and, bit_or, bit_xor, bit_count, bit_get, bitwise_not/bitwiseNOT), control (assert_true, raise_error), JVM stubs (broadcast, spark_partition_id, input_file_name, monotonically_increasing_id, current_catalog, current_database, current_schema, current_user, user), random (rand, randn with per-row values when used in with_column/with_columns), crypto (aes_encrypt, aes_decrypt, try_aes_decrypt; AES-128-GCM). Fixtures with_bit_ops, with_rand_seed, with_jvm_stubs. See PYSPARK_DIFFERENCES.md for crypto semantics. Phase 23COMPLETED: JSON/URL/misc (isin, url_decode, url_encode, json_array_length, parse_url, hash, shift_left, shift_right, version, equal_null, stack); fixtures with_isin, with_url_decode, with_url_encode, json_array_length_test, with_hash, with_shift_left. Phase 22COMPLETED: Datetime extensions (curdate, now, localtimestamp, date_diff, dateadd, datepart, extract, date_part, unix_micros, unix_millis, unix_seconds, dayname, weekday, make_timestamp, make_timestamp_ntz, make_interval, timestampadd, timestampdiff, days, hours, minutes, months, years, from_utc_timestamp, to_utc_timestamp, convert_timezone, current_timezone, to_timestamp); fixtures with_dayname, with_weekday, with_extract, with_unix_micros, make_timestamp_test, timestampadd_test, from_utc_timestamp_test. Phase 21COMPLETED: String (btrim, locate, conv), binary (hex, unhex, bin, getbit), type (to_char, to_varchar, to_number, try_to_number, try_to_timestamp), array (arrays_overlap, arrays_zip, explode_outer, posexplode_outer, array_agg), map (str_to_map), struct (transform_keys, transform_values). Phase 20COMPLETED: Ordering (asc, desc, nulls_first/last), aggregates (median, mode, stddev_pop, var_pop, try_sum, try_avg), numeric (bround, negate, positive, cot, csc, sec, e, pi); fixtures groupby_median, with_bround; OrderBy supports optional nulls_first. Phase 19COMPLETED: Aggregates (any_value, bool_and, bool_or, count_if, max_by, min_by, percentile, product, collect_list, collect_set), try_ (try_divide, try_add, try_subtract, try_multiply), misc (width_bucket, elt, bit_length, typeof); fixtures groupby_any_value, groupby_product, try_divide, width_bucket. Phase 18COMPLETED: array/map/struct (map_filter, zip_with, map_zip_with). Phase 17COMPLETED: Datetime/unix, math (pmod, factorial). Phase 16COMPLETED: String/regex. Phase 15COMPLETED: aliases, string, math, array_distinct. Remaining: ROADMAP Phases 25–26 (crate publish, Sparkless integration). Phase 14: Math (sin, cos, tan, asin, acos, atan, atan2, degrees, radians, signum), datetime (quarter, weekofyear, dayofweek, dayofyear, add_months, months_between, next_day), type/conditional (cast, try_cast, isnan, greatest, least); parity parser extended; fixtures math_sin_cos, datetime_quarter_week. Phase 13: String/binary/collection batch 1: ascii, format_number, overlay, position, char, chr, base64, unbase64, sha1, sha2, md5, array_compact implemented in Rust; parity parser and fixtures string_ascii, string_format_number. Phase 12: DataFrame methods implemented in Rust and exposed in Python: sample, random_split, first, head, tail, take, is_empty, to_json, to_pandas, explain, print_schema, checkpoint, repartition, coalesce, offset, summary, to_df, select_expr, col_regex, with_columns, with_columns_renamed, stat (cov/corr), na (fill/drop), freq_items, approx_quantile, crosstab, melt, except_all, intersect_all, sample_by, and Spark no-ops. Parity fixtures for first/head/offset: first_row, head_n, offset_n. Phase 11: Parity harness supports date, timestamp, and boolean in fixture input; datetime fixtures date_add_sub, datediff, datetime_hour_minute; String 6.4 fixtures string_soundex, string_levenshtein, string_crc32, string_xxhash64. Window fixtures percent_rank, cume_dist, ntile, nth_value are covered (multi-step workaround in harness). Phase 6: array functions array_position, array_remove, posexplode are implemented (via Polars list.eval); array fixtures array_contains, element_at, array_size, array_sum; array extensions (exists, forall, filter, transform, array_sum, array_mean; Phase 8: array_flatten, array_repeat implemented via map UDFs). Phase 8: Map (create_map, map_keys, map_values, map_entries, map_from_arrays implemented; Map as List(Struct{key, value})). JSON (get_json_object, from_json, to_json implemented). CI runs format, clippy, audit, deny, and all tests (including parity). Python smoke tests in tests/python/ (run via make test or make test-python); see EMBEDDING.md.

Phase test coverage

Parity fixtures are grouped into phases (A–G) defined in tests/fixtures/phase_manifest.json. Run phase-specific tests:

make test-parity-phases    # pytest tests/parity/
# Full suite (includes delta/integration markers): see docs/TESTING_GUIDE.md
pytest tests -n 10

Python phase smoke tests: test_phase_a_signature_alignment, test_phase_b_functions, test_phase_c_reader_writer, test_phase_d_dataframe_methods, test_phase_e_spark_session_catalog, test_phase_f_behavioral. When adding new fixtures, add the fixture name to the appropriate phase in phase_manifest.json. See TEST_CREATION_GUIDE.md for phase testing details.


Legend

  • ✅ Covered: Covered by one or more fixtures (listed)
  • 🚧 Not yet covered: Supported/partially supported but missing fixture coverage
  • ❌ Not implemented: Not implemented in the Rust API yet
  • ⚠️ Diverges: Implemented but intentionally differs from PySpark (must be documented)

Coverage Matrix (high level)

Area Capability Status Fixtures
Data creation SparkSession::create_dataframe (simple rows) ✅ Covered filter_age_gt_30, groupby_count, groupby_with_nulls (and most others)
Data creation SparkSession::create_dataframe_from_rows (arbitrary schema) ✅ Covered Used by plan interpreter; plan fixtures
Plan execution execute_plan (serialized logical plan) ✅ Covered tests/fixtures/plans/filter_select_limit, join_simple, with_column_functions (plan_parity_fixtures)
IO read_csv ✅ Covered read_csv
IO read_parquet ✅ Covered read_parquet
IO read_json ✅ Covered read_json
IO spark.read().option/options().csv (reader options) ✅ Covered read_csv_with_options
IO spark.read().table(name) (temp view) ✅ Covered read_table
DataFrame select ✅ Covered many (e.g. filter_age_gt_30)
DataFrame filter basic comparisons ✅ Covered filter_age_gt_30
DataFrame filter nested boolean logic ✅ Covered filter_and_or, filter_nested, filter_not
DataFrame orderBy ✅ Covered many (e.g. filter_age_gt_30, groupby_count)
GroupBy groupBy(...).count() ✅ Covered groupby_count, groupby_with_nulls
GroupBy groupBy(...).sum() ✅ Covered groupby_sum
GroupBy groupBy(...).avg() ✅ Covered groupby_avg
GroupBy groupBy(...).min() ✅ Covered groupby_min
GroupBy groupBy(...).max() ✅ Covered groupby_max
GroupBy groupBy with NULL keys ✅ Covered groupby_null_keys
GroupBy groupBy single-row groups / single group ✅ Covered groupby_single_row_groups, groupby_single_group
GroupBy multi-agg agg([..]) ✅ Covered groupby_multi_agg
GroupBy stddev, variance, count_distinct in agg ✅ Covered groupby_stddev_count_distinct
DataFrame withColumn (arithmetic) ✅ Covered type_coercion_mixed
DataFrame withColumn (logical/boolean) ✅ Covered with_logical_column
DataFrame withColumn (mixed arithmetic + comparison) ✅ Covered with_arithmetic_logical_mix
Functions when().then().otherwise() ✅ Covered when_otherwise, when_then_otherwise
Functions coalesce() ✅ Covered coalesce
Null semantics NULL equality/inequality ✅ Covered null_comparison_equality
Null semantics NULL ordering comparisons ✅ Covered null_comparison_ordering
Null semantics eqNullSafe ✅ Covered null_safe_equality
Null semantics NULLs inside filter predicates ✅ Covered null_in_filter
Type coercion numeric comparison coercion (int vs double) ✅ Covered type_coercion_numeric
Type coercion numeric arithmetic coercion (int + double) ✅ Covered type_coercion_mixed
Joins inner/left/right/outer joins ✅ Covered inner_join, left_join, right_join, outer_join
Joins join with NULL keys (inner: nulls excluded) ✅ Covered join_null_keys
Joins join with duplicate keys (cartesian match) ✅ Covered join_duplicate_keys
Windows row_number, rank, dense_rank, lag, lead ✅ Covered row_number_window, rank_window, lag_lead_window
Strings upper, lower, substring, concat, concat_ws ✅ Covered string_upper_lower, string_substring, string_concat
Strings length, trim, ltrim, rtrim, regexp_extract, regexp_replace, split, initcap ✅ Covered string_length_trim
Config spark.sql.caseSensitive (case-insensitive column resolution) ✅ Covered case_insensitive_columns
DataFrame union / unionAll ✅ Covered union_all
DataFrame unionByName ✅ Covered union_by_name
DataFrame distinct / dropDuplicates ✅ Covered distinct
DataFrame drop (columns) ✅ Covered drop_columns
DataFrame dropna ✅ Covered dropna
DataFrame fillna (single value) ✅ Covered fillna
DataFrame limit ✅ Covered limit
DataFrame withColumnRenamed ✅ Covered with_column_renamed
Array/List array, array_contains, element_at, size/array_size, array_join, array_sort, array_slice, explode; array_position, array_remove, posexplode (implemented) ✅ Covered array_contains, element_at, array_size
Windows first_value, last_value, percent_rank ✅ Covered first_value_window, last_value_window, percent_rank_window
Windows cume_dist, ntile, nth_value ✅ Covered cume_dist_window, ntile_window, nth_value_window (multi-step workaround in harness)
Strings regexp_extract_all, regexp_like ✅ Covered regexp_extract_all, regexp_like
Strings repeat, reverse, instr, lpad, rpad ✅ Covered string_repeat_reverse, string_lpad_rpad
Strings mask, translate, substring_index; soundex, levenshtein, crc32, xxhash64 (Phase 8) ✅ Covered string_mask, string_translate, string_substring_index, string_soundex, string_levenshtein, string_crc32, string_xxhash64
Strings (Phase 13) ascii, format_number, overlay, position, char, chr, base64, unbase64, sha1, sha2, md5 ✅ Implemented string_ascii, string_format_number
Strings (Phase 16) regexp_count, regexp_instr, regexp_substr, split_part, find_in_set, format_string, printf ✅ Covered regexp_count, regexp_substr, regexp_instr, split_part, find_in_set, format_string
Datetime (Phase 17) unix_timestamp, from_unixtime, make_date, timestamp_seconds/millis/micros, unix_date, date_from_unix_date ✅ Covered unix_timestamp, from_unixtime, make_date, timestamp_seconds, timestamp_millis, timestamp_micros, unix_date, date_from_unix_date
Math (Phase 17) pmod, factorial ✅ Covered pmod, factorial
Array array_sum, array_exists, forall, filter, transform; array_flatten, array_repeat (Phase 8); array_compact (Phase 13) ✅ Implemented array_sum
Map create_map, map_keys, map_values, map_entries, map_from_arrays (Phase 8) ✅ Implemented No fixture yet
JSON get_json_object, from_json, to_json (Phase 10) ✅ get_json_object covered json_get_json_object
Math sqrt, pow, exp, log ✅ Covered math_sqrt_pow
GroupBy first, last, approx_count_distinct in agg ✅ Covered groupby_first_last
GroupBy (Phase 19) any_value, bool_and, bool_or, product, collect_list, collect_set, count_if, percentile, max_by, min_by ✅ Covered groupby_any_value, groupby_product
Misc (Phase 19) try_divide, try_add, try_subtract, try_multiply, width_bucket, elt, bit_length, typeof ✅ Covered try_divide, width_bucket
DataFrame replace, crossJoin, describe, subtract, intersect ✅ Covered replace, cross_join, describe, subtract, intersect
SQL SparkSession::sql() (optional sql feature) ✅ Implemented No fixture (SQL translated to DataFrame ops; parity via DataFrame fixtures)
Datetime year, month, day, to_date, date_format; current_date, date_add, hour, etc. ✅ Covered date_add_sub, datediff, datetime_hour_minute
DataFrame (Phase 12) first, head, offset, sample, to_json, summary, stat, select_expr, freq_items, crosstab, melt, etc. (Rust + PyO3) ✅ first/head/offset/summary covered first_row, head_n, offset_n, summary; additional Phase 12 ops implemented, fixtures TBD
DataFrame (Phase D) createOrReplaceTempView, corr(col1,col2), cov(col1,col2), toDF/toJSON/toPandas, columns, cache, hint, repartitionByRange, sortWithinPartitions, sameSemantics, semanticHash, isLocal, inputFiles, writeTo (stub) ✅ Implemented Python: test_phase_d_dataframe_methods; table read via read_table fixture

Fixture Index

Fixture What it covers
filter_age_gt_30 Filter + select + orderBy (baseline)
filter_and_or AND/OR precedence + parentheses
filter_nested Nested boolean logic
filter_not NOT / negation
groupby_count groupBy + count + orderBy
groupby_with_nulls groupBy with NULLs
groupby_sum groupBy + sum
groupby_avg groupBy + avg
groupby_min groupBy + min
groupby_max groupBy + max
groupby_null_keys groupBy with NULL keys
groupby_single_row_groups groupBy with single-row groups (each key once)
groupby_single_group groupBy with single group (all same key)
join_null_keys inner join with NULL join keys (nulls excluded)
join_duplicate_keys inner join with duplicate keys (multiple matches)
case_insensitive_columns case-insensitive column resolution (filter/select/orderBy with mixed-case names)
read_csv CSV read path + operations
read_parquet Parquet read path + operations
read_json JSON read path + operations
read_csv_with_options spark.read.option("header","true").csv(path) with reader_options
read_table spark.read.table("name") via table_source (temp view)
with_logical_column Logical columns/expressions in withColumn
with_arithmetic_logical_mix Mixed arithmetic + comparison in withColumn
when_otherwise when/then/otherwise
when_then_otherwise chained when
coalesce coalesce null handling
null_comparison_equality NULL equality/inequality semantics
null_comparison_ordering NULL ordering semantics
null_safe_equality eqNullSafe semantics
null_in_filter NULLs in filter predicates
type_coercion_numeric int/double comparison coercion
type_coercion_mixed int+double arithmetic coercion
inner_join inner join on dept_id
left_join left join + orderBy
right_join right join + orderBy
outer_join outer join + orderBy
groupby_multi_agg groupBy + multiple aggregations in one agg()
groupby_stddev_count_distinct groupBy + stddev and count_distinct in agg
row_number_window row_number() over partition by dept order by salary desc
rank_window rank() over partition with ties
lag_lead_window lag and lead over partition
string_upper_lower upper(), lower()
string_substring substring() 1-based
string_concat concat(), concat_ws()
string_length_trim length(), trim() in withColumn
union_all union (vertical stack, same schema)
union_by_name unionByName (align columns by name)
distinct distinct (drop duplicate rows)
drop_columns drop(columns)
dropna dropna (drop rows with nulls)
fillna fillna (fill nulls with value)
limit limit(n)
with_column_renamed withColumnRenamed(old, new)
array_contains split + array_contains(col, lit)
element_at split + element_at(col, 1-based index)
array_size split + size(col)
first_value_window first_value over partition
last_value_window last_value over partition
percent_rank_window percent_rank over partition
cume_dist_window cume_dist over partition
ntile_window ntile(n) over partition
nth_value_window nth_value over partition
regexp_like regexp_like(col, pattern) boolean match
regexp_extract_all regexp_extract_all(col, pattern) list of matches
string_repeat_reverse repeat(col, n), reverse(col)
string_lpad_rpad lpad(col, len, pad), rpad(col, len, pad)
math_sqrt_pow sqrt(col), pow(col, exp)
groupby_first_last groupBy + first(name), last(name)
groupby_any_value groupBy + any_value(column)
groupby_product groupBy + product(column)
try_divide try_divide(col, col) — null on divide-by-zero
width_bucket width_bucket(value, min, max, num_bucket)
cross_join crossJoin (cartesian product)
describe describe() summary statistics
summary summary() (same as describe)
replace replace(column, old_value, new_value)
subtract subtract (set difference)
intersect intersect (set intersection)
first_row first() – first row as one-row DataFrame
head_n head(n) – first n rows
offset_n offset(n) – skip first n rows
string_mask mask(col) – replace upper/lower/digit with X/x/n
string_translate translate(col, from_str, to_str)
string_substring_index substring_index(col, delim, count) before/after nth delim
array_sum array(cols) + array_sum(col)
json_get_json_object get_json_object(col, '$.path')
date_add_sub date_add(col('d'), 7), date_sub(col('d'), 3)
datediff datediff(col('end'), col('start'))
datetime_hour_minute hour(col('ts')), minute(col('ts')) with timestamp input
string_soundex soundex(col('name'))
string_levenshtein levenshtein(col('a'), col('b'))
string_crc32 crc32(col('s'))
string_xxhash64 xxhash64(col('s'))
string_ascii ascii(col('name')) → first-char code point
string_format_number format_number(col('value'), 2) → fixed-decimal string
phase15_aliases_nvl_isnull nvl, nvl2, isnull, isnotnull (Phase 15)
string_left_right_replace left, right, replace, startswith, endswith, contains, like, ilike, rlike
math_cosh_cbrt cosh, sinh, tanh, acosh, asinh, atanh, cbrt, expm1, log1p, log10, log2, rint, hypot
array_distinct array_distinct(col) — JSON fixture may be skipped; Python tests in tests/dataframe/test_issue_415_array_distinct*.py and test_issue_439_* run in main suite
regexp_count regexp_count(col, pattern) – count non-overlapping matches
regexp_substr regexp_substr(col, pattern) – first match substring
regexp_instr regexp_instr(col, pattern) – 1-based position of first match
split_part split_part(col, delim, part_num) – 1-based part of split
find_in_set find_in_set(col('str'), col('set')) – 1-based index in comma-delimited list
format_string format_string('%d %s', col('a'), col('b')) – printf-style formatting
unix_timestamp unix_timestamp(col), unix_timestamp(col, format) – string to seconds
from_unixtime from_unixtime(col), from_unixtime(col, format) – seconds to formatted string
make_date make_date(year, month, day) – build date from parts
timestamp_seconds timestamp_seconds(col) – seconds epoch to timestamp
timestamp_millis timestamp_millis(col) – millis epoch to timestamp
timestamp_micros timestamp_micros(col) – micros epoch to timestamp
unix_date unix_date(col) – date to days since epoch
date_from_unix_date date_from_unix_date(col) – days to date
pmod pmod(a, b) – positive modulus
factorial factorial(n) – n! for n 0..20
with_bit_ops bit operations (bit_and, bit_or, bit_xor, bit_count, bit_get) via withColumn
  • Add more join edge-case fixtures (e.g. left/outer with null keys) if needed.
  • ROADMAP Phases 16–27: Phases 18–19 completed. Phases 20–24 (full parity in 5 parts), Phase 25 (readiness for post-refactor merge), Phase 26 (publish crate on crates.io), Phase 27 (Sparkless integration, 200+ tests). See ROADMAP.md, GAP_ANALYSIS_SPARKLESS_3.28.md.

Sparkless Test Conversion

Sparkless (github.com/eddiethedean/sparkless) has 270+ JSON expected outputs in tests/expected_outputs/. These can drive robin-sparkless parity tests via a fixture converter that maps Sparkless JSON format → robin-sparkless fixture format. See SPARKLESS_INTEGRATION_ANALYSIS.md §4 for: - Fixture format comparison (input_data vs input/rows; expected_output vs expected) - Conversion steps per test - Priority order: parity/dataframe, parity/functions, then parity/sql