Skip to content

Roadmap: PySpark Semantics in a Rust Crate (No JVM)

Core Principle: PySpark Parity on a Polars Backend

Primary Goal: Implement a Rust crate whose behavior closely emulates PySpark's SparkSession / DataFrame / Column semantics, but runs entirely in Rust with Polars as the execution engine (no JVM, no Python runtime).

Sparkless Integration Goal: Robin-sparkless is designed to replace the backend logic of Sparkless (the Python PySpark drop-in). Sparkless would call robin-sparkless via FFI for DataFrame execution. See SPARKLESS_INTEGRATION_ANALYSIS.md for architecture mapping, structural learnings, and test conversion strategy.

Constraints: - Use Polars as the underlying dataframe/expressions engine. - Match PySpark behavior where practical (null handling, grouping, joins, expression semantics). - Stay honest about differences (and document them) when perfect parity is impossible.

Short-Term Objectives (0–1 month) ✅ COMPLETED

  1. Clarify API Surface
  2. ✅ Decided on core PySpark API surface: SparkSession.builder, createDataFrame, core DataFrame transforms/actions
  3. ✅ Implemented Rust equivalents with Rust types and error handling

  4. Minimal Parity Slice

  5. ✅ End-to-end support for PySpark-style pipelines in Rust:
    • ✅ Session creation (SparkSession::builder().get_or_create())
    • createDataFrame from simple rows (Vec<(i64, i64, String)> tuples)
    • select, filter, groupBy(...).count(), orderBy
    • show, collect, count
  6. ✅ Behavior-checked these operations against PySpark on fixtures (36 scenarios passing)

  7. Behavioral Tests

  8. ✅ Test harness implemented (tests/parity.rs):
    • ✅ Runs pipelines in PySpark via tests/gen_pyspark_cases.py
    • ✅ Runs logical equivalent through Robin Sparkless
    • ✅ Compares schemas and results with proper null/type handling
  9. ✅ JSON fixtures generated and versioned (tests/fixtures/*.json)
  10. ✅ All parity tests passing for initial slice

Medium-Term Objectives (1–3 months) ✅ COMPLETED

  1. Data Source ReadersCOMPLETED
  2. ✅ Implement CSV/Parquet/JSON readers using Polars IO
  3. ✅ Basic PySpark-like schema inference behavior (header detection, infer_schema_length)
  4. ✅ Parity tests for file reading operations (3 new fixtures: read_csv, read_parquet, read_json)

  5. Expression SemanticsCOMPLETE

  6. ✅ Basic Column and functions (col, lit, basic aggregates)
  7. ✅ String literal support in filter expressions
  8. ✅ Expand functions: when, coalesce implemented
  9. when().then().otherwise() conditional expressions
  10. coalesce() for null handling
  11. withColumn() support for adding computed columns
  12. ✅ PySpark-style null comparison semantics, including eqNullSafe and comparisons against NULL columns
  13. ✅ Basic numeric type coercion for int/double comparisons and arithmetic (via Polars expressions)
  14. ✅ Complex filter expressions with logical operators (AND, OR, NOT, &&, ||, !) and nested conditions
  15. ✅ Arithmetic expressions in withColumn (+, -, *, /) with proper operator precedence
  16. ✅ Mixed arithmetic and logical expressions (e.g., (col('a') + col('b')) > col('c'))

  17. Grouping and JoinsCOMPLETE

  18. ✅ Basic groupBy + count() working with parity tests
  19. ✅ Additional aggregates: sum, avg, min, max on GroupedData
  20. ✅ Generic agg() method for multiple aggregations
  21. ✅ Column reordering after groupBy to match PySpark order (grouping columns first)
  22. ✅ Ensure groupBy + aggregates behave like PySpark (verified via groupby_null_keys, groupby_single_group, groupby_single_row_groups; see PYSPARK_DIFFERENCES.md)
  23. ✅ Implement common join types (inner, left, right, outer) with parity fixtures
  24. ✅ Multi-agg support in GroupedData::agg() with parity fixture
  25. ✅ Window functions: row_number, rank, dense_rank, lag, lead with .over(partition_by) parity fixtures
  26. ✅ String functions: upper, lower, substring, concat, concat_ws with parity fixtures

Longer-Term Objectives (3+ months) – Full Sparkless Backend

The path to full backend replacement is planned in FULL_BACKEND_ROADMAP.md. Summary:

Phase Goal Est. Effort
1. Foundation Structural alignment, case sensitivity, fixture converter 2–3 weeks
2. High-Value Functions String (length, trim, regexp_*), datetime (to_date, date_add), math (stddev, variance) 4–6 weeks
3. DataFrame Methods union, distinct, drop, fillna, limit, withColumnRenamed 3–4 weeks
4. (Historical) PyO3 Bridge Python bindings so Sparkless can call robin-sparkless (since removed from this repo; future bindings expected out-of-tree) 4–6 weeks
5. Test Conversion Convert 50+ Sparkless tests, CI integration 2–3 weeks
6. Broad Function Parity Array (array_position, array_remove, posexplode ✅; array_repeat, array_flatten ✅ Phase 8), Map/JSON/string 6.4/window ✅ 8–12 weeks
7. SQL & Advanced SQL executor, Delta Lake, performance COMPLETED (optional features)
8. Remaining Parity COMPLETED (Feb 2026): array_repeat, array_flatten, Map (create_map, map_keys, map_values, map_entries, map_from_arrays), String 6.4 (soundex, levenshtein, crc32, xxhash64); window fixtures covered; documentation of differences
  1. Broader API Coverage (Phases 2 & 6)
  2. String: length, trim, regexp_extract, regexp_replace, split, initcap (string basics ✅ done).
  3. Datetime: to_date, date_add, date_sub, date_format, year, month, day, etc.
  4. Math: abs, ceil, floor, sqrt, stddev, variance, count_distinct.
  5. Array/Map/JSON: array_, map_, get_json_object, from_json, to_json.
  6. Function parity with Sparkless (403+); use PYSPARK_FUNCTION_MATRIX as checklist.
  7. UDF story: pure-Rust UDFs and Python UDFs (scalar and column-wise vectorized) with a session-scoped registry; minimal grouped vectorized aggregation support via pandas_udf(..., function_type="grouped_agg"). Other pandas_udf variants and UDTFs remain out of scope.

  8. DataFrame Methods (Phase 3)

  9. union, unionByName, distinct, drop, dropna, fillna, limit, withColumnRenamed.
  10. crossJoin, replace, describe, cache/persist.

  11. Historical PyO3 Bridge (Phase 4) ✅ COMPLETED

  12. Earlier phases included an optional pyo3 feature and robin_sparkless Python module with SparkSession, DataFrame, Column, and GroupedData.
  13. That historical module is gone. The current supported Python integration for Sparkless v4 lives in this repository under python/ (native extension crate sparkless-native + Python wrapper package sparkless). Other bindings can still live out-of-tree and call the Rust crate via FFI.

  14. Performance & Robustness (Phase 7) ✅ COMPLETED

    • Benchmarks: cargo bench compares robin-sparkless vs plain Polars (filter → select → groupBy).
    • Target: within ~2x of Polars for supported ops.
    • Error handling: clearer messages (column names, hints); Troubleshooting in QUICKSTART.md.
  15. Remaining Parity – Phase 10 & Phase 8COMPLETED

    • String 6.4: mask, translate, substring_index, soundex, levenshtein, crc32, xxhash64 (all implemented; Phase 8 UDFs via strsim, crc32fast, twox-hash, soundex).
    • Array extensions: array_exists, array_forall, array_filter, array_transform, array_sum, array_mean, array_flatten, array_repeat (all implemented; Phase 8 map UDFs).
    • Map (6b): create_map, map_keys, map_values, map_entries, map_from_arrays (all implemented; Map as List(Struct{key, value}); Phase 8).
    • JSON (6c): get_json_object, from_json, to_json implemented (extract_jsonpath, dtype-struct).
    • Window fixtures: percent_rank, cume_dist, ntile, nth_value covered (multi-step workaround in harness).
    • Documentation of differences: See PYSPARK_DIFFERENCES.md.
  16. Gap closure (bitmap, datetime/interval, misc, DataFrame)COMPLETED (Feb 2026)

    • Bitmap (5): bitmap_bit_position, bitmap_bucket_number, bitmap_construct_agg, bitmap_count, bitmap_or_agg.
    • Datetime/interval: make_dt_interval, make_ym_interval, to_timestamp_ltz, to_timestamp_ntz.
    • Misc: sequence, shuffle, inline, inline_outer; regr_* (9 regression aggregates); stubs for call_function, UDF/UDTF, window/session_window, HLL/sketch (see PYSPARK_DIFFERENCES.md).
    • DataFrame: cube(), rollup() with .agg(); write().mode().format().save() (parquet/csv/json); data, toLocalIterator (same as collect); persist/unpersist (no-op); stubs for rdd, foreach, foreachPartition, mapInPandas, mapPartitions, storageLevel, isStreaming, withWatermark.
    • Deferred: XML (from_xml, to_xml, schema_of_xml), XPath, sentences — documented in ROBIN_SPARKLESS_MISSING.md.

Sparkless Integration Phases (see SPARKLESS_INTEGRATION_ANALYSIS.md, FULL_BACKEND_ROADMAP.md)

  • Phase 1 – Foundation: Structural alignment (split dataframe.rs), case sensitivity, fixture converter. Prereqs done: joins, windows, strings.
  • Phase 2 – High-Value Functions: String (length, trim, regexp_*), datetime (to_date, date_add), math (stddev, variance)
  • Phase 3 – DataFrame Methods: union, unionByName, distinct, drop, dropna, fillna, limit, withColumnRenamed ✅ COMPLETED
  • Phase 4 – Historical PyO3 Bridge: Python bindings for Sparkless to call robin-sparkless ✅ COMPLETED (bindings have since been removed from this repo; future bindings are expected to live out-of-tree)
  • Phase 5 – Test Conversion: Converter extended (join, window, withColumn, union, distinct, drop, dropna, fillna, limit, withColumnRenamed); parity discovers tests/fixtures/ + tests/fixtures/converted/; make sparkless-parity (set SPARKLESS_EXPECTED_OUTPUTS); SPARKLESS_PARITY_STATUS.md for pass/fail; 159 passing (50+ target met; +10 signature-alignment fixtures) ✅ COMPLETED
  • Phase 6 – Broad Parity: Array (6a ✅; array_position, array_remove, posexplode via list.eval; array_repeat, array_flatten ✅ Phase 8), Map (6b ✅ Phase 8), JSON (6c ✅), additional string (6e ✅; 6.4 soundex/levenshtein/crc32/xxhash64 ✅ Phase 8), window extensions (6d ✅; percent_rank/cume_dist/ntile/nth_value covered).
  • Phase 7 – SQL & AdvancedCOMPLETED: Optional SQL (sql feature: spark.sql(), temp views); optional Delta (delta feature: read_delta, read_delta_with_version, write_delta); benchmarks and error-message improvements. See FULL_BACKEND_ROADMAP.md §7.
  • Phase 8 – Remaining ParityCOMPLETED (Feb 2026): array_repeat, array_flatten; Map (create_map, map_keys, map_values, map_entries, map_from_arrays); String 6.4 (soundex, levenshtein, crc32, xxhash64); window fixtures covered; documentation of differences.

Success Metrics

We know we're on track if:

  • Behavioral parity: For core operations (filter, select, orderBy, groupBy+count/sum/avg/min/max/agg, when/coalesce, basic type coercion, null semantics, joins, window functions, array and string functions, math, datetime, type/conditional), DataFrame methods (union, distinct, drop, dropna, fillna, limit, withColumnRenamed), and file readers (CSV/Parquet/JSON), PySpark and Robin Sparkless produce the same schema and data on test fixtures. Status: PASSING (159 fixtures)
  • Documentation of differences: Any divergence from PySpark semantics is called out in PYSPARK_DIFFERENCES.md (window, SQL, Delta, Phase 8).
  • Performance envelope: For supported operations, we stay within ~2x of doing the same thing directly in Polars. Status: BENCHMARKED (cargo bench; see QUICKSTART.md § Benchmarks)

Full backend targets (see FULL_BACKEND_ROADMAP.md):

Metric Current After Phase 24 (full parity) Full Backend (Phase 27)
Parity fixtures 159 180+ 180+
Functions ~283 ~280 (Sparkless 3.28) ~280
DataFrame methods ~55+ ~55+ 85
Sparkless tests passing (robin backend) 0 200+
Language bindings Out-of-tree (Sparkless or host repo) Out-of-tree Out-of-tree

Current Status (February 2026)

Completed Core Parity Slice: - ✅ SparkSession::create_dataframe for simple tuples - ✅ DataFrame::filter with simple expressions (col('x') > N, string comparisons) - ✅ DataFrame::select - ✅ DataFrame::order_by / sort - ✅ DataFrame::group_byGroupedData::count() - ✅ DataFrame::with_column for adding computed columns - ✅ DataFrame::join() for inner, left, right, outer joins - ✅ File readers: read_csv(), read_parquet(), read_json() with schema inference - ✅ Expression functions: when().then().otherwise(), coalesce() - ✅ GroupedData aggregates: sum(), avg(), min(), max(), and generic agg() - ✅ Null comparison semantics: equality/inequality vs NULL, ordering comparisons vs NULL, and eqNullSafe - ✅ Numeric type coercion for int/double comparisons and simple arithmetic - ✅ Window functions: Column::rank(), row_number(), dense_rank(), lag(), lead(), first_value, last_value, percent_rank() with .over(partition_by) - ✅ Array functions: array_size/size, array_contains, element_at, explode, array_sort, array_join, array_slice - ✅ String functions: upper(), lower(), substring(), concat(), concat_ws(), length, trim, regexp_extract, regexp_replace, regexp_extract_all, regexp_like, split - ✅ Datetime: year(), month(), day(), to_date(), date_format(format) (chrono strftime) - ✅ DataFrame methods: union, union_by_name, distinct, drop, dropna, fillna, limit, with_column_renamed - ✅ Rust API for bindings: Full surface for Sparkless or other hosts: SparkSession, create_dataframe_from_rows, create_dataframe_from_single_column (single-column schema, verify_schema), execute_plan, DataFrame::collect_as_json_rows, etc. Python bindings are maintained out-of-tree (e.g. in Sparkless). See EMBEDDING.md. - ✅ Phase 9 (high-value functions & DataFrame methods): Datetime (current_date, current_timestamp, date_add, date_sub, hour, minute, second, datediff, last_day, trunc); string (repeat, reverse, instr, lpad, rpad); math (sqrt, pow, exp, log); conditional (nvl/ifnull, nullif, nanvl); GroupedData (first, last, approx_count_distinct); DataFrame (replace, cross_join, describe, cache/persist/unpersist, subtract, intersect). - ✅ Parity test harness with 159 passing fixtures: - filter_age_gt_30: filter + select + orderBy - filter_and_or: nested boolean logic with AND/OR and parentheses - filter_nested: nested boolean logic - filter_not: NOT / negation semantics - groupby_count: groupBy + count + orderBy - groupby_with_nulls: groupBy with null values + count - groupby_sum: groupBy + sum aggregation - groupby_avg: groupBy + avg aggregation - read_csv: CSV file reading + operations - read_parquet: Parquet file reading + operations - read_json: JSON file reading + operations - with_logical_column: boolean columns / logical expressions in withColumn - with_arithmetic_logical_mix: mixed arithmetic + logical comparison in withColumn - when_otherwise: when().then().otherwise() conditional expressions - when_then_otherwise: chained when expressions - coalesce: null handling with coalesce - null_comparison_equality: null equality/inequality semantics - null_comparison_ordering: ordering comparisons vs NULL - null_safe_equality: null-safe equality (eqNullSafe) - null_in_filter: null handling in filter predicates - type_coercion_numeric: int vs double comparison coercion - type_coercion_mixed: int + double arithmetic coercion - inner_join, left_join, right_join, outer_join: join operations - groupby_multi_agg: multiple aggregations in one agg() call - row_number_window, rank_window, lag_lead_window: window functions - string_upper_lower, string_substring, string_concat, string_length_trim: string functions - union_all, union_by_name, distinct, drop_columns, dropna, fillna, limit, with_column_renamed: DataFrame methods - array_contains, element_at, array_size: array functions (split + array_contains/element_at/size) - regexp_like, regexp_extract_all: string regex fixtures - string_repeat_reverse, string_lpad_rpad: repeat, reverse, lpad, rpad - math_sqrt_pow: sqrt, pow - groupby_first_last: first, last aggregates - cross_join, describe, replace, subtract, intersect: DataFrame methods - Phase 12: first_row, head_n, offset_n: first/head/offset DataFrame methods - Phase 20: groupby_median, with_bround: median, bround; OrderBy with nulls_first - Phase 21: with_btrim, with_hex, with_conv, with_str_to_map, arrays_overlap, arrays_zip - Phase 22: with_dayname, with_weekday, with_extract, with_unix_micros, make_timestamp_test, timestampadd_test, from_utc_timestamp_test - Phase 23: with_isin, with_url_decode, with_url_encode, json_array_length_test, with_hash, with_shift_left - ✅ Phase 25 (readiness for post-refactor merge): Plan interpreter (execute_plan(session, data, schema, plan) in Rust; Python robin_sparkless.execute_plan(data, schema, plan_json)); expression interpreter (all scalar functions; serialized expr trees → Polars Expr in src/plan/expr.rs); logical plan schema (LOGICAL_PLAN_FORMAT.md); plan fixtures (tests/fixtures/plans/filter_select_limit.json, join_simple.json, with_column_functions.json) with plan_parity_fixtures test; create_dataframe_from_rows (Rust + Python) for arbitrary schema and list of dicts/rows.

Next Steps to Full Sparkless Parity

To reach full Sparkless parity (robin-sparkless as a complete backend replacement), the remaining work is organized into phases 12–27 below (phases 9–25 complete). Phases 26–27 are crate publish and Sparkless integration. Reference: FULL_BACKEND_ROADMAP.md, PYSPARK_FUNCTION_MATRIX, GAP_ANALYSIS_SPARKLESS_3.28.md.

Phase overview

Phase Goal Est. effort
9 High-value functions + DataFrame methods COMPLETED
10 Complex types (Map, JSON, array_repeat, string 6.4) + window fixture simplification COMPLETED
11 Parity scale (93 fixtures), harness date/datetime, converter + CI COMPLETED
12 DataFrame methods parity (~55+ methods; freq_items, approx_quantile, crosstab, melt, sample_by, no-ops; PyO3 stat/na/to_pandas) COMPLETED
13 Functions batch 1: string, binary, collection (~80 new → ~200 total) COMPLETED
14 Functions batch 2: math, datetime, type/conditional (~100 new → ~300 total) COMPLETED
15 Functions batch 3: remaining functions + fixture growth (88 → 150+ fixtures, 403 functions) COMPLETED
16 Remaining gaps 1: string/regex (regexp_count, regexp_instr, regexp_substr, split_part, find_in_set, format_string, printf) COMPLETED
17 Remaining gaps 2: datetime/unix (unix_timestamp, from_unixtime, make_date, timestamp_*, pmod, factorial) COMPLETED
18 Remaining gaps 3: array/map/struct (array_append, array_prepend, array_insert, array_except/intersect/union, map_concat, map_from_entries, map_contains_key, get, named_struct, struct, map_filter, map_zip_with, zip_with) COMPLETED
19 Remaining gaps 4: aggregates and try_* (any_value, bool_and, bool_or, count_if, max_by, min_by, percentile, product, try_add/divide/subtract/multiply/sum/avg, try_element_at, width_bucket, elt, bit_length, typeof) COMPLETED
20 Full parity 1: ordering, aggregates, numeric COMPLETED
21 Full parity 2: string, binary, type, array/map/struct COMPLETED
22 Full parity 3: datetime extensions COMPLETED
23 Full parity 4: JSON, CSV, URL, misc COMPLETED
24 Full parity 5: bit, control, JVM stubs, random, crypto COMPLETED
25 Readiness for post-refactor merge (plan interpreter, expression interpreter for all scalar functions, plan schema, 3 plan fixtures, create_dataframe_from_rows) COMPLETED
26 Prepare and publish robin-sparkless as a Rust crate (crates.io, API stability, docs, release) 2–3 weeks
27 Sparkless integration (BackendFactory "robin", 200+ tests), FFI/bindings surface 4–6 weeks

Phase 9 – High-value functions & DataFrame methods (4–6 weeks) ✅ COMPLETED

Goal: Complete remaining high-use functions and DataFrame methods so most Sparkless pipelines can run without falling back.

  • Datetime: current_date, current_timestamp, date_add, date_sub, hour, minute, second, datediff, last_day, trunc ✅. (to_date, date_format, year, month, day ✅)
  • String: repeat, reverse, instr/locate, lpad/rpad ✅. (length, trim, regexp_*, split, initcap ✅)
  • Math: abs, ceil, floor, round, sqrt, pow, exp, log ✅; aggregates first, last, approx_count_distinct ✅. (stddev, variance, count_distinct ✅)
  • Conditional/null: ifnull/nvl, nullif, nanvl ✅.
  • DataFrame methods: replace, crossJoin, describe, cache/persist/unpersist, subtract, intersect ✅.

Outcome: Functions ~37+ → ~85+; DataFrame methods ~25 → ~35+. Parity harness extended with parse_with_column_expr for new functions and new Operation variants (Replace, CrossJoin, Describe, Subtract, Intersect). New fixtures: string_repeat_reverse, string_lpad_rpad, math_sqrt_pow, groupby_first_last, cross_join, describe, replace, subtract, intersect.


Phase 10 – Complex types & window parity ✅ COMPLETED

Goal: Implement Map, JSON, remaining array/string functions, and enable full window parity fixtures.

  • Array: array_repeat, array_flatten (Phase 8, Expr::map UDFs); array_exists, array_forall, array_filter, array_transform, array_sum, array_mean ✅.
  • Map: create_map, map_keys, map_values, map_entries, map_from_arrays (Phase 8; Map as List(Struct{key, value})) ✅.
  • JSON: get_json_object, from_json, to_json ✅; optional base64, unbase64 (deferred).
  • String 6.4: mask, translate, substring_index (Phase 10); soundex, levenshtein, crc32, xxhash64 (Phase 8 UDFs) ✅.
  • Window: percent_rank, cume_dist, ntile, nth_value parity fixtures covered via multi-step workaround; documented in PYSPARK_DIFFERENCES.md.

Optional (document only): SQL extensions (subqueries, CTEs, HAVING, DDL); Delta schema evolution, MERGE. See PYSPARK_DIFFERENCES.md.

Outcome: Functions ~85+ → ~120+; Phase 8 and Phase 10 “remaining parity” items implemented.


Phase 11 – Parity scale & test conversion ✅ COMPLETED

Goal: Grow parity coverage and integrate Sparkless test conversion into CI.

  • Parity harness: Date/datetime and boolean column support in fixture input (tests/parity.rs); dtype_to_string and collect_to_simple_format for Date/Datetime/Int8; types_compatible for date/timestamp.
  • Fixture growth: 73 → 80 (Phase 11) → 82 (Phase 12–13) → 88 (Phase 14–15) → 93 (Phase 16) fixtures (Phase 16: regexp_count, regexp_substr, regexp_instr, split_part, find_in_set, format_string; array_distinct skipped).
  • Converter: Date/timestamp type mapping added in tests/convert_sparkless_fixtures.py.
  • CI: .github/workflows/ci.yml runs format, clippy, audit, deny, and Rust tests (including pyspark_parity_fixtures). Locally: make check-full (Rust-only). Python tests and bindings live out-of-tree.
  • Docs: TEST_CREATION_GUIDE.md documents date/timestamp fixture format; SPARKLESS_PARITY_STATUS.md updated with CI note.

Outcome: 93 parity fixtures passing; CI runs parity; SPARKLESS_PARITY_STATUS kept current.


Phase 12 – DataFrame methods parity (4–6 weeks) ✅ COMPLETED

Goal: Implement the remaining ~50–60 DataFrame methods so robin-sparkless reaches the 85-method target before integration.

  • Implemented: sample, random_split, first, head, take, tail, is_empty, to_df, stat() (cov, corr), summary (alias describe), to_json, explain, print_schema, checkpoint, local_checkpoint, repartition, coalesce, select_expr, col_regex, with_columns, with_columns_renamed, na() (fill, drop), to_pandas, offset, transform, freq_items, approx_quantile, crosstab, melt, except_all, intersect_all, sample_by (stratified sampling); Spark no-ops: hint, is_local, input_files, same_semantics, semantic_hash, observe, with_watermark.
  • Parity: Fixtures first_row.json, head_n.json, offset_n.json for first/head/offset.
  • PyO3: All of the above exposed on Python DataFrame where applicable; random_split, summary, to_df, select_expr, col_regex, with_columns, with_columns_renamed, stat() (returns DataFrameStat with cov/corr), na() (returns DataFrameNa with fill/drop), to_pandas (same as collect; for use with pandas.DataFrame.from_records).
  • Outcome: DataFrame methods ~35 → ~55+; ready for function batches (Phase 13).

Phase 13 – Functions batch 1: string, binary, collection (4–6 weeks) ✅ COMPLETED

Goal: Add ~80 functions in string, binary, and collection categories toward ~200 total functions.

  • String ✅ (partial): ascii, format_number, overlay, position, char, chr implemented; base64, unbase64 (base64 crate). Remaining: format_string, encode/decode, etc.
  • Binary ✅ (partial): sha1, sha2(bit_length), md5 (string in → hex string out; sha1, sha2, md5 crates). AES_* deferred.
  • Collection ✅ (partial): array_compact (drop nulls from list). Remaining: array_distinct, map extensions, etc.
  • Parity: Parser branches for all new functions; fixtures string_ascii, string_format_number (82 fixtures at Phase 13 completion; 88 after Phase 15).
  • PyO3: Module and Column methods for ascii, format_number, overlay, position, char, chr, base64, unbase64, sha1, sha2, md5, array_compact.
  • Outcome: Functions ~120 → ~130+; 82 parity fixtures at Phase 13. Phase 14 (84 fixtures), Phase 15 (88 fixtures, ~175+ functions).

Phase 14 – Functions batch 2: math, datetime, type/conditional ✅ COMPLETED

Goal: Add ~100 functions in math, datetime, type casting, and conditional logic toward ~300 total.

  • Math ✅: sin, cos, tan, asin, acos, atan, atan2, degrees, radians, signum (UDFs).
  • Datetime ✅: quarter, weekofyear, dayofweek, dayofyear, add_months, months_between, next_day (Polars dt + chrono UDFs).
  • Type/conditional ✅: cast, try_cast (PySpark-like type names), isnan, greatest, least.
  • Parity ✅: Parser branches for all; fixtures math_sin_cos, datetime_quarter_week (84 fixtures).
  • PyO3 ✅: Module and Column methods for all Phase 14 functions. Docs updated.

Phase 15 – Functions batch 3: remaining + fixture growth (6–8 weeks) ✅ COMPLETED

Goal: Reach 403-function parity and grow parity fixtures from 88 to 150+.

  • Functions: Batch 1 aliases (nvl, nvl2, substr, power, ln, ceiling, lcase, ucase, dayofmonth, to_degrees, to_radians, isnull, isnotnull), Batch 2 string (left, right, replace, startswith, endswith, contains, like, ilike, rlike), Batch 3 math (cosh, sinh, tanh, acosh, asinh, atanh, cbrt, expm1, log1p, log10, log2, rint, hypot), Batch 4 array_distinct — all implemented. See PHASE15_GAP_LIST.md, GAP_ANALYSIS_SPARKLESS_3.28.md.
  • Fixtures: 82 → 88 hand-written; target 150+ with convert_sparkless_fixtures.py when Sparkless expected_outputs available.
  • Outcome: Phase 15 scope done; remaining gaps covered in Phases 16–19.

Phase 16 – Remaining gaps 1: string and regex (2–3 weeks) ✅ COMPLETED

Goal: Implement remaining string/regex functions from PHASE15_GAP_LIST.md and GAP_ANALYSIS_SPARKLESS_3.28.md.

  • String/regex: regexp_count, regexp_instr, regexp_substr, split_part, find_in_set, format_string, printf — all implemented. (btrim, locate, conv → Phase 21.)
  • Parity: Parser branches and fixtures (regexp_count, regexp_substr, regexp_instr, split_part, find_in_set, format_string).
  • PyO3: Exposed on Column and module.
  • Outcome: String/regex gap closed; ready for Phase 17.

Phase 17 – Remaining gaps 2: datetime and unix (2–3 weeks) ✅ COMPLETED

Goal: Implement datetime/unix and remaining math from the gap list.

  • Datetime/unix: unix_timestamp, to_unix_timestamp, from_unixtime, make_date, timestamp_seconds, timestamp_millis, timestamp_micros, unix_date, date_from_unix_date — all implemented; optional convert_timezone, current_timezone, now, curdate, localtimestamp deferred.
  • Math: pmod, factorial — implemented.
  • Parity: Parser branches and 10 fixtures.
  • PyO3: Exposed on Column and module.
  • Outcome: Datetime/unix gap closed; ready for Phase 18.

Phase 18 – Remaining gaps 3: array, map, struct ✅ COMPLETED

Goal: Implement remaining array/map/struct functions.

  • Array: array_append, array_prepend, array_insert, array_except, array_intersect, array_union, zip_with (UDF + list.eval). (arrays_overlap, arrays_zip, explode_outer, posexplode_outer, array_agg → Phase 21.)
  • Map: map_concat, map_filter, map_zip_with (list.eval / UDF + list.eval), map_from_entries, map_contains_key, get (map element). (str_to_map → Phase 21.)
  • Struct: named_struct, struct. (transform_keys, transform_values → Phase 21.)
  • Parity: Parser for struct-field expressions; fixtures map_filter, zip_with, map_zip_with (121 → 124).
  • PyO3: map_filter_value_gt, zip_with_coalesce, map_zip_with_coalesce convenience helpers.
  • Outcome: Array/map/struct gap closed including deferred functions; ready for Phase 19.

Phase 19 – Remaining gaps 4: aggregates and try_COMPLETED*

Goal: Implement remaining aggregates and try_* / misc functions.

  • Aggregates: any_value, bool_and, bool_or, every/some, count_if, max_by, min_by, percentile, product, collect_list, collect_set — all implemented. percentile_approx deferred.
  • Try_*: try_divide, try_add, try_subtract, try_multiply, try_element_at — all implemented.
  • Misc: width_bucket, elt, bit_length, typeof.
  • Parity: groupby_any_value, groupby_product, try_divide, width_bucket (124 → 128).
  • PyO3: GroupedData methods; try_*, width_bucket, elt, bit_length, typeof.
  • Outcome: Aggregates and try_* gap closed; ready for Phase 20 (full parity part 1).

Phase 20 – Full parity 1: ordering, aggregates, numeric (1.5–2 weeks) ✅ COMPLETED

Goal: High-value ordering and aggregate functions. Reference: PARITY_CHECK_SPARKLESS_3.28.md.

  • Ordering ✅: asc, asc_nulls_first, asc_nulls_last, desc, desc_nulls_first, desc_nulls_last (return SortOrder for use in orderBy); DataFrame::order_by_exprs(sort_orders).
  • Aggregates ✅: median, mode; stddev_pop, stddev_samp, var_pop, var_samp; try_sum, try_avg. Deferred: covar_pop, covar_samp, corr (as groupBy agg), kurtosis, skewness, percentile_approx.
  • Numeric ✅: bround; negate, negative, positive; cot, csc, sec; e, pi (constants).

Parity: Fixtures groupby_median, with_bround; OrderBy supports optional nulls_first. Outcome: ~25 new functions; ready for Phase 21.


Phase 21 – Full parity 2: string, binary, type, array/map/struct (2 weeks) ✅ COMPLETED

Goal: String/binary/type and collection extensions.

  • String: btrim; locate; conv (base conversion). ✅
  • Binary: hex, unhex; bin; getbit. Defer: decode, encode; to_binary, try_to_binary. ✅
  • Type/cast: to_char, to_number, to_varchar; try_to_number, try_to_timestamp. ✅
  • Array: array_agg; arrays_overlap, arrays_zip; explode_outer, posexplode_outer. Defer: aggregate. ✅
  • Map: str_to_map. ✅
  • Struct: transform_keys, transform_values. ✅

Parity: Fixtures with_btrim, with_hex, with_conv, with_str_to_map, arrays_overlap, arrays_zip. Outcome: ~20 new functions; ready for Phase 22.


Phase 22 – Full parity 3: datetime extensions ✅ COMPLETED

Goal: Complete datetime function set.

  • convert_timezone, current_timezone; curdate; date_diff, date_part; dateadd, datepart; dayname, weekday; days, hours, months, years; extract; localtimestamp; now.
  • make_timestamp, make_timestamp_ntz, make_interval; timestampadd, timestampdiff; to_timestamp; from_utc_timestamp, to_utc_timestamp; unix_micros, unix_millis, unix_seconds.

Parity: Fixtures with_dayname, with_weekday, with_extract, with_unix_micros, make_timestamp_test, timestampadd_test, from_utc_timestamp_test; with_curdate_now skipped (non-deterministic). Outcome: ~25 new functions; ready for Phase 23.


Phase 23 – Full parity 4: JSON, CSV, URL, misc ✅ COMPLETED

Goal: JSON/CSV/URL and misc helpers.

  • JSON: json_array_length, parse_url. Defer: json_object_keys, json_tuple.
  • Schema/I/O: Defer from_csv, to_csv, schema_of_csv, schema_of_json.
  • URL: url_decode, url_encode.
  • Misc: isin, isin_i64, isin_str; equal_null; hash; shiftLeft, shiftRight, shiftRightUnsigned; stack; version. Defer: inline, inline_outer, sentences, sequence, shuffle, call_function.

Parity: Fixtures with_isin, with_url_decode, with_url_encode, json_array_length_test, with_hash, with_shift_left. Outcome: ~18 new functions; ready for Phase 24.


Phase 24 – Full parity 5: bit, control, JVM stubs, random, crypto (1.5–2 weeks) ✅ COMPLETED

Goal: Bit operations, control flow, JVM compatibility stubs, random, crypto.

  • Bit ✅: bit_and, bit_or, bit_xor, bit_count, bit_get; bitwiseNOT, bitwise_not. (bitmap_* deferred if in Sparkless 3.28.)
  • Control ✅: assert_true, raise_error.
  • JVM stubs ✅: broadcast, spark_partition_id, input_file_name, monotonically_increasing_id, current_catalog, current_database, current_schema, current_user, user — no-ops or placeholders for API compatibility.
  • Random ✅: rand(seed), randn(seed) — real RNG with optional seed; one value per row when used in with_column/with_columns (PySpark-like). udf, pandas_udf — stub or minimal support.
  • Crypto ✅: aes_encrypt, aes_decrypt, try_aes_decrypt (AES-128-GCM). See PYSPARK_DIFFERENCES.md.

Defer: regr_* (regression), XML (from_xml, to_xml, xpath_*), ML/HLL — document as out of scope.

Parity: Fixture with_bit_ops; with_rand_seed, with_jvm_stubs. PyO3: All new functions exposed. Outcome: Full parity 5 done; ready for Phase 25 (readiness for post-refactor merge).


Phase 25 – Readiness for post-refactor merge (3–4 weeks) ✅ COMPLETED

Goal: Prepare robin-sparkless so that when Sparkless completes its refactor plan (serializable logical plan and materialize_from_plan), integration is a thin adapter. See READINESS_FOR_SPARKLESS_PLAN.md for full detail.

  • Plan interpreter ✅: execute_plan(session, data, schema, plan) in Rust (src/plan/); exposed in Python as robin_sparkless.execute_plan(data, schema, plan_json) returning a DataFrame (call .collect() for list of dicts).
  • Logical plan schema ✅: docs/LOGICAL_PLAN_FORMAT.md defines op names, payload shapes, and expression tree format.
  • Expression interpreter ✅: src/plan/expr.rs turns serialized expression trees into Polars Expr. Supports col, lit, comparison/logical ops, eq_null_safe, and all scalar functions (string, math, datetime, type/conditional, bit, array, map, misc; two-arg when). See LOGICAL_PLAN_FORMAT.md.
  • Plan-based fixtures and tests ✅: tests/fixtures/plans/ (filter_select_limit, join_simple); plan_parity_fixtures test in tests/parity.rs.
  • Flexible DataFrame creation ✅: Rust create_dataframe_from_rows(rows, schema) in session; Python createDataFrame(data, schema=None) on SparkSession (#372).

Outcome: When Sparkless emits a logical plan, we can execute it via execute_plan; Sparkless robin backend becomes a thin wrapper. Ready for Phase 26 (crate publish).


Phase 26 – Prepare and publish robin-sparkless as a Rust crate (2–3 weeks)

Goal: Make the library ready for public use as a Rust dependency before Sparkless integration.

  • Crate metadata: Finalize Cargo.toml (description, license, repository, keywords, categories); ensure semver and version are release-ready.
  • API surface: Review public API for stability; document breaking-change policy (e.g. semver for 0.x); consider #[deprecated] or feature flags for experimental APIs.
  • Documentation: cargo doc builds cleanly; add or expand crate-level and module docs; link from README to docs.rs (or hosted docs).
  • Release workflow: Tag releases; publish to crates.io (cargo publish).
  • CI: Ensure CI runs full check (format, clippy, audit, deny, tests, benchmarks) on release branches/tags; document how to cut a release in CONTRIBUTING or README.

Outcome: robin-sparkless is published on crates.io; consumers can add it as a dependency; clear release and versioning process.


Phase 27 – Sparkless integration (4–6 weeks)

Goal: Make robin-sparkless a runnable backend for Sparkless while keeping the Rust API stable.

  • Sparkless repo: Add "robin" backend option to BackendFactory; when selected, delegate DataFrame execution to robin-sparkless via FFI using the published Rust crate. After Sparkless refactor, wire materialize_from_plan to our execute_plan.
  • Fallback: When an operation is not supported, raise a clear error or fall back to Sparkless’s existing backend; document behavior.
  • Target: 200+ Sparkless tests passing with robin backend (current: 0).

Outcome: Sparkless can run against robin-sparkless; 200+ tests passing; backend behavior matches the documented function set.


Ongoing / optional (lower priority)

  • Trait-based backend: QueryExecutor, DataMaterializer for pluggability (FULL_BACKEND_ROADMAP Phase 1 Future).
  • Expression model doc: Document Column/Expr mapping to Sparkless ColumnOperation trees.
  • UDFs: Pure-Rust UDFs, scalar and column-wise vectorized Python UDFs via spark.udf().register, and grouped vectorized aggregation UDFs via pandas_udf(..., function_type="grouped_agg"). Full pandas_udf decorator surface and UDTFs remain out of scope.
  • Memory / scale: Optional memory profiling and large-dataset handling.

Summary metrics (full parity targets)

Metric Current After Phase 22 After Phase 24 (full parity) After Phase 25 (readiness) After Phase 26 (crate) Full Backend (Phase 27)
Parity fixtures 159 159 180+ 180+ 180+ 180+
Functions ~283 ~283 ~280 ~280 ~280 ~280
DataFrame methods ~55+ ~55+ ~55+ ~55+ ~55+ 85
Plan interpreter / execute_plan Yes Yes Yes Yes
Crate on crates.io No Yes Yes
Sparkless tests passing (robin backend) 0 200+

Testing Strategy

To enforce the roadmap above, we will:

  • Use PySpark as the oracle
  • ✅ Maintain a small Python tool (tests/gen_pyspark_cases.py) that:
    • ✅ Runs pipelines in PySpark.
    • ✅ Emits JSON fixtures describing inputs, operations, and expected outputs.
  • ✅ Fixtures live under tests/fixtures/ and are versioned with the repo.
  • Sparkless test conversion: Build a fixture converter to reuse Sparkless's 270+ expected_outputs; see SPARKLESS_INTEGRATION_ANALYSIS.md §4.

  • Drive Rust tests from fixtures

  • tests/parity.rs:
    • ✅ Reconstructs a DataFrame from each fixture's input.
    • ✅ Applies the listed operations (filter, select, groupBy+agg, orderBy, etc.) via the Rust API.
    • ✅ Collects results and compares schema + rows against expected, with well-defined tolerances (e.g. for floats, or order-insensitive comparisons where PySpark doesn't guarantee ordering).
  • Plan fixtures (Phase 25): plan_parity_fixtures loads tests/fixtures/plans/*.json, runs plan::execute_plan, and asserts schema and rows match expected. Fixtures use the serialized plan format in LOGICAL_PLAN_FORMAT.md.

  • Track parity coverage

  • ✅ Initial parity test infrastructure in place
  • ✅ Maintain a parity matrix (operations × data types × edge cases) in PARITY_STATUS.md (see PARITY_STATUS.md)
    • Each cell indicates whether it's covered (and by which fixture), not yet covered, or intentionally diverges.

This testing strategy makes PySpark behavior the reference and gives us a clear, automated way to detect regressions as the Rust API and Polars versions evolve.