Sparkless Parity Status (Phase 5)¶
This doc records pass/fail counts and failure reasons for fixtures converted from Sparkless expected_outputs (see CONVERTER_STATUS.md).
How to run¶
This will: (1) convert Sparkless expected_outputs to tests/fixtures/converted/ (with --dedupe to skip scenarios already in hand-written fixtures), (2) regenerate each converted fixture’s expected from PySpark via tests/regenerate_expected_from_pyspark.py (requires pip install pyspark), (3) run cargo test pyspark_parity_fixtures. Parity is thus Robin vs PySpark, not Sparkless.
Or run parity only (hand-written + tests/fixtures/converted/*.json):
Or run phase-specific parity (fixtures from tests/fixtures/phase_manifest.json):
PARITY_PHASE=a cargo test pyspark_parity_fixtures
make test-parity-phase-a # Phase A: signature alignment
make test-parity-phase-b # Phase B: high-value functions
make test-parity-phase-c # Phase C: Reader/Writer
make test-parity-phase-d # Phase D: DataFrame methods
make test-parity-phase-e # Phase E: SparkSession/Catalog (no fixtures)
make test-parity-phase-f # Phase F: behavioral
make test-parity-phase-g # Phase G: fixture expansion
make test-parity-phases # All phases
Keeping expectations aligned with PySpark¶
-
Hand-written fixtures: Expected values in
Usetests/fixtures/*.json(top-level, excludingconverted/andplans/) can be refreshed from PySpark using the regeneration script. Run:--dry-runto print diffs without writing. The script builds a PySpark DataFrame from each fixture’sinput.schemaandinput.rows, applies the sameoperations(filter, select, withColumn, orderBy, groupBy, window, join, etc.), then overwrites that fixture’sexpected.schemaandexpected.rows. Fixtures with unsupported expressions or ops are left unchanged. After regenerating, runcargo test pyspark_parity_fixtures; any failure indicates Robin vs PySpark divergence (fix Robin or leave fixture unregenerated). Requires PySpark and Java 17+. -
Converted fixtures: Run
python tests/regenerate_expected_from_pyspark.py tests/fixtures/converted --include-skippedto process fixtures withskip: true. On success, skip is removed. Requires PySpark and Java 17+ (e.g.export JAVA_HOME=/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home). -
Python tests: Tests in
tests/dataframe/test_dataframe_parity.py(and other tests undertests/) assert behaviour that matches PySpark. Expectations are predetermined (no PySpark at test runtime). For example,test_create_dataframe_from_rows_schema_pyspark_paritycompares Robin output to a fixed expected list derived from PySpark 3.5. -
CI: A CI job can run the regenerator in
--dry-runand fail if current expected ≠ PySpark result (requires PySpark and Java 17+ in the environment).
Pass/fail summary¶
| Source | Converted | Passing | Failing | Skipped |
|---|---|---|---|---|
Hand-written (tests/fixtures/*.json) |
— | 212 | 0 | 0 |
Sparkless converted (tests/fixtures/converted/*.json) |
226 | 216+ | 0 | ~7 (converter fixes applied; re-run sparkless-parity to verify) |
PySpark extracted (tests/fixtures/pyspark_extracted/*.json) |
~22 | varies | — | Run make batch-regenerate-extracted to populate expected |
Target: 200+ tests passing (hand-written + converted). Current: 212+ passing (hand-written; 11 previously skipped now enabled: expect_error mode, JVM stubs, mock_dates, timezone UTC, octet_length, with_rand_seed, with_unix_micros). Phase G ✅ COMPLETED. Phase C ✅: DataFrameReader/Writer parity (read_csv_with_options, read_table fixtures). Phase D ✅: DataFrame methods (createOrReplaceTempView, corr/cov, toDF/toJSON/toPandas, columns, cache, stubs). Phase E ✅: SparkSession & Catalog stubs (catalog, conf, range, version, newSession, stop, getActiveSession, getDefaultSession, udf; Catalog 27 methods). Fixes applied: first_row, hex/url_encode, replace, case_insensitive_columns, describe/summary, split semantics (regex for |), timestamp format normalization; 11 hand-written fixtures remain skipped (see Skipped fixtures). Phase 25 completed: plan interpreter, expression interpreter (all scalar functions), 3 plan fixtures (tests/fixtures/plans/: filter_select_limit, join_simple, with_column_functions), create_dataframe_from_rows. Phase 22: datetime extensions. Phase 21: ordering, aggregates, numeric. Phase 19–18: aggregates, array/map/struct. Phase 17–15: datetime/unix, regexp, aliases, string, math. Phase 27 (Sparkless integration) target: 200+ Sparkless tests passing with robin backend (after Phase 26 publish Rust crate). CI runs parity on hand-written (and optionally converted) fixtures; when Sparkless repo is available, run make sparkless-parity and update this doc.
When Sparkless repo is available¶
- Set
export SPARKLESS_EXPECTED_OUTPUTS=/path/to/sparkless/tests/expected_outputs. - Run
make sparkless-parity(converts with--dedupe, regenerates expected from PySpark, then runscargo test pyspark_parity_fixtures). - Update the table above with converted count and passing/failing/skipped for
tests/fixtures/converted/*.json. - For any failing fixture, add a row under "Failure reasons" and use
skip: true+skip_reasonin the fixture if it is a known unsupported or semantic difference. Because expected is PySpark-derived, failures indicate Robin vs PySpark divergence.
Window parity (Sparkless issues #22–#35)¶
Window functions are implemented and covered by hand-written parity fixtures: row_number_window, rank_window, lag_lead_window, first_value_window, last_value_window, percent_rank_window, cume_dist_window, ntile_window, nth_value_window. The harness in tests/parity.rs supports row_number, rank, dense_rank, lag, lead, first_value, last_value, percent_rank, cume_dist, ntile. Sum-over-window and approx_count_distinct window are supported via the same harness. When running Sparkless tests with robin backend, ensure the adapter uses these APIs.
Array parity (Sparkless issues #36–#49)¶
Array functions are implemented and covered by hand-written parity fixtures: array_contains, array_distinct (first-occurrence order), array_join, array_position, array_remove, array_sort, array_union, element_at, explode, size/array_size, plus array_append, array_insert, array_intersect, array_except, array_prepend, array_sum, arrays_overlap, arrays_zip, zip_with. Converted fixtures in tests/fixtures/converted/ for array_* are currently skipped until expected is regenerated from PySpark. When running Sparkless tests with robin backend, ensure the adapter uses these APIs.
Datetime / type parity (Sparkless issues #50–#57)¶
Datetime and cast/alias behaviour are implemented and covered by hand-written parity fixtures: date_add_sub, datediff, datetime_hour_minute, datetime_quarter_week, with_extract, with_weekday, with_dayname, make_date, unix_timestamp, unix_date, timestampadd_test, months_between_round_off, make_timestamp_test, type_coercion_*, and cast/alias in expression tests. Functions include date_add, date_sub, date_format, dayofmonth, dayofweek, month, to_date, and cast/alias in select and filter. When running Sparkless tests with robin backend, ensure the adapter uses these APIs.
String / binary parity batch 1 (Sparkless issues #86–#99)¶
String and binary functions in this batch are implemented and covered by hand-written parity fixtures: ascii, base64/unbase64, concat_ws (string_concat), crc32 (string_crc32), hex, initcap, levenshtein (string_levenshtein), repeat, reverse, soundex, and related tests. Fixtures include string_crc32, string_levenshtein, string_concat, string_upper_lower, string_left_right_replace, string_length_trim, string_substring_index, string_lpad_rpad, string_translate, string_xxhash64, etc. When running Sparkless tests with robin backend, ensure the adapter uses these APIs.
String / binary parity batch 2 (Sparkless issues #100–#116)¶
String and binary functions in this batch are implemented and covered by hand-written parity fixtures: string_length/length, like/rlike (with escape), lower/upper, lpad/rpad, regexp_extract, rtrim/trim, substring/substr, translate, substring_index, xxhash64. Fixtures include string_length_trim, string_upper_lower, string_lpad_rpad, string_substring, string_substring_index, string_translate, string_xxhash64, like_escape_char, ilike_escape_char, etc. When running Sparkless tests with robin backend, ensure the adapter uses these APIs.
SQL / session parity (Sparkless issues #122–#140)¶
SQL (via SparkSession::sql() with optional sql feature) and session behaviour are implemented: the SQL translator maps to DataFrame ops; cache_table/uncache_table, CASE WHEN, IN, JOINs, LIKE, ORDER BY, subquery, UNION, create_table_from_dataframe, insert_from_select, basic/filtered select, and group_by are supported via the translator and session catalog. Parity is exercised via DataFrame fixtures (filter, join, orderBy, groupBy, etc.) and plan fixtures. When running Sparkless tests with robin backend, ensure the adapter uses the SQL API and session catalog as documented.
Features: createDataFrame and pivot (Sparkless issues #151, #156)¶
createDataFrame (#151, #372): Implemented. Python: spark.createDataFrame(data, schema=None) accepts list of dicts (infer), list of tuples with column names, or explicit schema (list of (name, dtype_str) or StructType). Rust: create_dataframe(data, column_names) for 3-tuples; create_dataframe_from_rows(rows, schema) for arbitrary schema. Supported dtypes include bigint, double, string, boolean, date, timestamp.
pivot (#156): Implemented on GroupedData (groupBy(...).pivot(...).sum/avg/...). DataFrame.pivot(pivot_col, values) without groupBy may still differ from PySpark; use groupBy().pivot() or crosstab(col1, col2).
Failure reasons (converted fixtures)¶
When a converted fixture fails, classify and document here:
| Fixture name | Reason | Notes |
|---|---|---|
| aggregation | converter: select avg_salary before groupBy+agg | fix convert_sparkless_fixtures.py |
| group_by | converter: select count before groupBy+agg | fix convert_sparkless_fixtures.py |
| filter_select_groupby_agg | converter: filter col(name)>0 invalid; select avg_salary before agg | name is string; op order wrong |
| filter_operations | converter: filter col(name)>0 invalid | name is string |
| filter_with_boolean | converter: filter col(name)>0 invalid | name is string |
| select_expr_groupby_agg_orderby | converter: select level,count,avg_salary before groupBy+agg | op order wrong |
| select_with_alias | converter: select user_id,full_name but schema has id,name | needs withColumn alias |
| withcolumn_filter_orderby | converter: filter col(name)>0 invalid; expected bonus vs ops | name string; expr mismatch |
| group_by | converter: expected schema inconsistent with groupBy id agg count | |
| semi_join | converter: expected row count inconsistent with left_semi semantics | |
| select_expr_groupby_agg_orderby | coalesce(when(...), lit(...)) requires coalesce to accept when() |
Reason categories:
- converter: Wrong op/schema/rows from conversion; fix in
convert_sparkless_fixtures.pyor Sparkless metadata. - unsupported: Missing function or expression in robin-sparkless (Phase 6).
- semantic: Implemented but result differs (e.g. null handling); document and optionally add
skip: true+skip_reasonin the fixture.
Skipped fixtures¶
Fixtures with "skip": true in JSON are not run. As of June 2026, one hand-written fixture remains skipped: with_rand_seed.json (RNG algorithm differs from Spark XORShiftRandom). Other fixtures that may diverge from PySpark without being skipped include window frame edge cases, hash functions, set ops column order, and non-deterministic timestamp helpers.
Closed-issue test coverage¶
| Issue range | Coverage (parity fixtures and/or Python tests) |
|---|---|
| #1–#21 (core) | Python: test_sparkless_parity_*, test_filter_with_and_or_operators; parity: filter, select, orderBy, groupBy, join, limit, first, replace, case_insensitive_columns, describe, summary |
| #22–#35 (window) | Parity: row_number_window, rank_window, lag_lead_window, cume_dist_window, first_value_window, percent_rank_window; nth_value_window, last_value_window, ntile_window (skipped: frame/ordering) |
| #36–#49 (array) | Parity: array_size, array_contains, element_at, array_union, array_except, array_intersect, array_distinct, array_append, array_prepend, array_insert, zip_with; arrays_zip (skipped) |
| #50–#57 (datetime) | Parity: date_add_sub, datediff, datetime_, make_date, type_coercion_, to_timestamp_format, make_timestamp_test; timestamp_seconds/millis/micros (UTC expected) |
| #86–#99, #100–#116 (string) | Parity: with_hex, with_url_encode, string_*, replace, like_escape_char; with_hash/string_xxhash64 (skipped) |
| #122–#140 (SQL/session) | Python: test_sql_select_where_returns_rows; parity: filter, join, groupBy plan fixtures |
| #151 (createDataFrame) | Python: test_create_dataframe_from_rows_schema_pyspark_parity, test_create_dataframe_and_collect |
| #156 (pivot) | Python: test_pivot_raises_not_implemented |
Related¶
- CONVERTER_STATUS.md – converter usage and operation mapping
- FULL_BACKEND_ROADMAP.md – Phase 5 goals