Roadmap: PySpark Semantics in a Rust Crate (No JVM)¶
Core Principle: PySpark Parity on a Polars Backend¶
Primary Goal: Implement a Rust crate whose behavior closely emulates PySpark's SparkSession / DataFrame / Column semantics, but runs entirely in Rust with Polars as the execution engine (no JVM, no Python runtime).
Sparkless Integration Goal: Robin-sparkless is designed to replace the backend logic of Sparkless (the Python PySpark drop-in). Sparkless would call robin-sparkless via FFI for DataFrame execution. See SPARKLESS_INTEGRATION_ANALYSIS.md for architecture mapping, structural learnings, and test conversion strategy.
Constraints: - Use Polars as the underlying dataframe/expressions engine. - Match PySpark behavior where practical (null handling, grouping, joins, expression semantics). - Stay honest about differences (and document them) when perfect parity is impossible.
Short-Term Objectives (0–1 month) ✅ COMPLETED¶
- Clarify API Surface ✅
- ✅ Decided on core PySpark API surface:
SparkSession.builder,createDataFrame, coreDataFrametransforms/actions -
✅ Implemented Rust equivalents with Rust types and error handling
-
Minimal Parity Slice ✅
- ✅ End-to-end support for PySpark-style pipelines in Rust:
- ✅ Session creation (
SparkSession::builder().get_or_create()) - ✅
createDataFramefrom simple rows (Vec<(i64, i64, String)>tuples) - ✅
select,filter,groupBy(...).count(),orderBy - ✅
show,collect,count
- ✅ Session creation (
-
✅ Behavior-checked these operations against PySpark on fixtures (36 scenarios passing)
-
Behavioral Tests ✅
- ✅ Test harness implemented (
tests/parity.rs):- ✅ Runs pipelines in PySpark via
tests/gen_pyspark_cases.py - ✅ Runs logical equivalent through Robin Sparkless
- ✅ Compares schemas and results with proper null/type handling
- ✅ Runs pipelines in PySpark via
- ✅ JSON fixtures generated and versioned (
tests/fixtures/*.json) - ✅ All parity tests passing for initial slice
Medium-Term Objectives (1–3 months) ✅ COMPLETED¶
- Data Source Readers ✅ COMPLETED
- ✅ Implement CSV/Parquet/JSON readers using Polars IO
- ✅ Basic PySpark-like schema inference behavior (header detection, infer_schema_length)
-
✅ Parity tests for file reading operations (3 new fixtures: read_csv, read_parquet, read_json)
-
Expression Semantics ✅ COMPLETE
- ✅ Basic
Columnand functions (col,lit, basic aggregates) - ✅ String literal support in filter expressions
- ✅ Expand functions:
when,coalesceimplemented - ✅
when().then().otherwise()conditional expressions - ✅
coalesce()for null handling - ✅
withColumn()support for adding computed columns - ✅ PySpark-style null comparison semantics, including
eqNullSafeand comparisons against NULL columns - ✅ Basic numeric type coercion for int/double comparisons and arithmetic (via Polars expressions)
- ✅ Complex filter expressions with logical operators (AND, OR, NOT, &&, ||, !) and nested conditions
- ✅ Arithmetic expressions in withColumn (+, -, *, /) with proper operator precedence
-
✅ Mixed arithmetic and logical expressions (e.g.,
(col('a') + col('b')) > col('c')) -
Grouping and Joins ✅ COMPLETE
- ✅ Basic
groupBy+count()working with parity tests - ✅ Additional aggregates:
sum,avg,min,maxon GroupedData - ✅ Generic
agg()method for multiple aggregations - ✅ Column reordering after groupBy to match PySpark order (grouping columns first)
- ✅ Ensure
groupBy+ aggregates behave like PySpark (verified via groupby_null_keys, groupby_single_group, groupby_single_row_groups; see PYSPARK_DIFFERENCES.md) - ✅ Implement common join types (inner, left, right, outer) with parity fixtures
- ✅ Multi-agg support in
GroupedData::agg()with parity fixture - ✅ Window functions: row_number, rank, dense_rank, lag, lead with
.over(partition_by)parity fixtures - ✅ String functions: upper, lower, substring, concat, concat_ws with parity fixtures
Longer-Term Objectives (3+ months) – Full Sparkless Backend¶
The path to full backend replacement is planned in FULL_BACKEND_ROADMAP.md. Summary:
| Phase | Goal | Est. Effort |
|---|---|---|
| 1. Foundation | Structural alignment, case sensitivity, fixture converter | 2–3 weeks |
| 2. High-Value Functions | String (length, trim, regexp_*), datetime (to_date, date_add), math (stddev, variance) | 4–6 weeks |
| 3. DataFrame Methods | union, distinct, drop, fillna, limit, withColumnRenamed | 3–4 weeks |
| 4. (Historical) PyO3 Bridge | Python bindings so Sparkless can call robin-sparkless (since removed from this repo; future bindings expected out-of-tree) | 4–6 weeks |
| 5. Test Conversion | Convert 50+ Sparkless tests, CI integration | 2–3 weeks |
| 6. Broad Function Parity | Array (array_position, array_remove, posexplode ✅; array_repeat, array_flatten ✅ Phase 8), Map/JSON/string 6.4/window ✅ | 8–12 weeks |
| 7. SQL & Advanced | SQL executor, Delta Lake, performance | ✅ COMPLETED (optional features) |
| 8. Remaining Parity | ✅ COMPLETED (Feb 2026): array_repeat, array_flatten, Map (create_map, map_keys, map_values, map_entries, map_from_arrays), String 6.4 (soundex, levenshtein, crc32, xxhash64); window fixtures covered; documentation of differences |
- Broader API Coverage (Phases 2 & 6)
- String: length, trim, regexp_extract, regexp_replace, split, initcap (string basics ✅ done).
- Datetime: to_date, date_add, date_sub, date_format, year, month, day, etc.
- Math: abs, ceil, floor, sqrt, stddev, variance, count_distinct.
- Array/Map/JSON: array_, map_, get_json_object, from_json, to_json.
- Function parity with Sparkless (403+); use PYSPARK_FUNCTION_MATRIX as checklist.
-
UDF story: pure-Rust UDFs and Python UDFs (scalar and column-wise vectorized) with a session-scoped registry; minimal grouped vectorized aggregation support via
pandas_udf(..., function_type="grouped_agg"). Other pandas_udf variants and UDTFs remain out of scope. -
DataFrame Methods (Phase 3)
- union, unionByName, distinct, drop, dropna, fillna, limit, withColumnRenamed.
-
crossJoin, replace, describe, cache/persist.
-
Historical PyO3 Bridge (Phase 4) ✅ COMPLETED
- Earlier phases included an optional
pyo3feature androbin_sparklessPython module with SparkSession, DataFrame, Column, and GroupedData. -
That historical module is gone. The current supported Python integration for Sparkless v4 lives in this repository under
python/(native extension cratesparkless-native+ Python wrapper packagesparkless). Other bindings can still live out-of-tree and call the Rust crate via FFI. -
Performance & Robustness (Phase 7) ✅ COMPLETED
- Benchmarks:
cargo benchcompares robin-sparkless vs plain Polars (filter → select → groupBy). - Target: within ~2x of Polars for supported ops.
- Error handling: clearer messages (column names, hints); Troubleshooting in QUICKSTART.md.
- Benchmarks:
-
Remaining Parity – Phase 10 & Phase 8 ✅ COMPLETED
- String 6.4: mask, translate, substring_index, soundex, levenshtein, crc32, xxhash64 (all implemented; Phase 8 UDFs via strsim, crc32fast, twox-hash, soundex).
- Array extensions: array_exists, array_forall, array_filter, array_transform, array_sum, array_mean, array_flatten, array_repeat (all implemented; Phase 8 map UDFs).
- Map (6b): create_map, map_keys, map_values, map_entries, map_from_arrays (all implemented; Map as List(Struct{key, value}); Phase 8).
- JSON (6c): get_json_object, from_json, to_json implemented (extract_jsonpath, dtype-struct).
- Window fixtures: percent_rank, cume_dist, ntile, nth_value covered (multi-step workaround in harness).
- Documentation of differences: See PYSPARK_DIFFERENCES.md.
-
Gap closure (bitmap, datetime/interval, misc, DataFrame) ✅ COMPLETED (Feb 2026)
- Bitmap (5): bitmap_bit_position, bitmap_bucket_number, bitmap_construct_agg, bitmap_count, bitmap_or_agg.
- Datetime/interval: make_dt_interval, make_ym_interval, to_timestamp_ltz, to_timestamp_ntz.
- Misc: sequence, shuffle, inline, inline_outer; regr_* (9 regression aggregates); stubs for call_function, UDF/UDTF, window/session_window, HLL/sketch (see PYSPARK_DIFFERENCES.md).
- DataFrame: cube(), rollup() with .agg(); write().mode().format().save() (parquet/csv/json); data, toLocalIterator (same as collect); persist/unpersist (no-op); stubs for rdd, foreach, foreachPartition, mapInPandas, mapPartitions, storageLevel, isStreaming, withWatermark.
- Deferred: XML (from_xml, to_xml, schema_of_xml), XPath, sentences — documented in ROBIN_SPARKLESS_MISSING.md.
Sparkless Integration Phases (see SPARKLESS_INTEGRATION_ANALYSIS.md, FULL_BACKEND_ROADMAP.md)¶
- Phase 1 – Foundation: Structural alignment (split dataframe.rs), case sensitivity, fixture converter. Prereqs done: joins, windows, strings.
- Phase 2 – High-Value Functions: String (length, trim, regexp_*), datetime (to_date, date_add), math (stddev, variance)
- Phase 3 – DataFrame Methods: union, unionByName, distinct, drop, dropna, fillna, limit, withColumnRenamed ✅ COMPLETED
- Phase 4 – Historical PyO3 Bridge: Python bindings for Sparkless to call robin-sparkless ✅ COMPLETED (bindings have since been removed from this repo; future bindings are expected to live out-of-tree)
- Phase 5 – Test Conversion: Converter extended (join, window, withColumn, union, distinct, drop, dropna, fillna, limit, withColumnRenamed); parity discovers
tests/fixtures/+tests/fixtures/converted/;make sparkless-parity(set SPARKLESS_EXPECTED_OUTPUTS); SPARKLESS_PARITY_STATUS.md for pass/fail; 159 passing (50+ target met; +10 signature-alignment fixtures) ✅ COMPLETED - Phase 6 – Broad Parity: Array (6a ✅; array_position, array_remove, posexplode via list.eval; array_repeat, array_flatten ✅ Phase 8), Map (6b ✅ Phase 8), JSON (6c ✅), additional string (6e ✅; 6.4 soundex/levenshtein/crc32/xxhash64 ✅ Phase 8), window extensions (6d ✅; percent_rank/cume_dist/ntile/nth_value covered).
- Phase 7 – SQL & Advanced ✅ COMPLETED: Optional SQL (
sqlfeature:spark.sql(), temp views); optional Delta (deltafeature:read_delta,read_delta_with_version,write_delta); benchmarks and error-message improvements. See FULL_BACKEND_ROADMAP.md §7. - Phase 8 – Remaining Parity ✅ COMPLETED (Feb 2026): array_repeat, array_flatten; Map (create_map, map_keys, map_values, map_entries, map_from_arrays); String 6.4 (soundex, levenshtein, crc32, xxhash64); window fixtures covered; documentation of differences.
Success Metrics¶
We know we're on track if:
- ✅ Behavioral parity: For core operations (filter, select, orderBy, groupBy+count/sum/avg/min/max/agg, when/coalesce, basic type coercion, null semantics, joins, window functions, array and string functions, math, datetime, type/conditional), DataFrame methods (union, distinct, drop, dropna, fillna, limit, withColumnRenamed), and file readers (CSV/Parquet/JSON), PySpark and Robin Sparkless produce the same schema and data on test fixtures. Status: PASSING (159 fixtures)
- ✅ Documentation of differences: Any divergence from PySpark semantics is called out in PYSPARK_DIFFERENCES.md (window, SQL, Delta, Phase 8).
- ✅ Performance envelope: For supported operations, we stay within ~2x of doing the same thing directly in Polars. Status: BENCHMARKED (
cargo bench; see QUICKSTART.md § Benchmarks)
Full backend targets (see FULL_BACKEND_ROADMAP.md):
| Metric | Current | After Phase 24 (full parity) | Full Backend (Phase 27) |
|---|---|---|---|
| Parity fixtures | 159 | 180+ | 180+ |
| Functions | ~283 | ~280 (Sparkless 3.28) | ~280 |
| DataFrame methods | ~55+ | ~55+ | 85 |
| Sparkless tests passing (robin backend) | 0 | — | 200+ |
| Language bindings | Out-of-tree (Sparkless or host repo) | Out-of-tree | Out-of-tree |
Current Status (February 2026)¶
Completed Core Parity Slice:
- ✅ SparkSession::create_dataframe for simple tuples
- ✅ DataFrame::filter with simple expressions (col('x') > N, string comparisons)
- ✅ DataFrame::select
- ✅ DataFrame::order_by / sort
- ✅ DataFrame::group_by → GroupedData::count()
- ✅ DataFrame::with_column for adding computed columns
- ✅ DataFrame::join() for inner, left, right, outer joins
- ✅ File readers: read_csv(), read_parquet(), read_json() with schema inference
- ✅ Expression functions: when().then().otherwise(), coalesce()
- ✅ GroupedData aggregates: sum(), avg(), min(), max(), and generic agg()
- ✅ Null comparison semantics: equality/inequality vs NULL, ordering comparisons vs NULL, and eqNullSafe
- ✅ Numeric type coercion for int/double comparisons and simple arithmetic
- ✅ Window functions: Column::rank(), row_number(), dense_rank(), lag(), lead(), first_value, last_value, percent_rank() with .over(partition_by)
- ✅ Array functions: array_size/size, array_contains, element_at, explode, array_sort, array_join, array_slice
- ✅ String functions: upper(), lower(), substring(), concat(), concat_ws(), length, trim, regexp_extract, regexp_replace, regexp_extract_all, regexp_like, split
- ✅ Datetime: year(), month(), day(), to_date(), date_format(format) (chrono strftime)
- ✅ DataFrame methods: union, union_by_name, distinct, drop, dropna, fillna, limit, with_column_renamed
- ✅ Rust API for bindings: Full surface for Sparkless or other hosts: SparkSession, create_dataframe_from_rows, create_dataframe_from_single_column (single-column schema, verify_schema), execute_plan, DataFrame::collect_as_json_rows, etc. Python bindings are maintained out-of-tree (e.g. in Sparkless). See EMBEDDING.md.
- ✅ Phase 9 (high-value functions & DataFrame methods): Datetime (current_date, current_timestamp, date_add, date_sub, hour, minute, second, datediff, last_day, trunc); string (repeat, reverse, instr, lpad, rpad); math (sqrt, pow, exp, log); conditional (nvl/ifnull, nullif, nanvl); GroupedData (first, last, approx_count_distinct); DataFrame (replace, cross_join, describe, cache/persist/unpersist, subtract, intersect).
- ✅ Parity test harness with 159 passing fixtures:
- filter_age_gt_30: filter + select + orderBy
- filter_and_or: nested boolean logic with AND/OR and parentheses
- filter_nested: nested boolean logic
- filter_not: NOT / negation semantics
- groupby_count: groupBy + count + orderBy
- groupby_with_nulls: groupBy with null values + count
- groupby_sum: groupBy + sum aggregation
- groupby_avg: groupBy + avg aggregation
- read_csv: CSV file reading + operations
- read_parquet: Parquet file reading + operations
- read_json: JSON file reading + operations
- with_logical_column: boolean columns / logical expressions in withColumn
- with_arithmetic_logical_mix: mixed arithmetic + logical comparison in withColumn
- when_otherwise: when().then().otherwise() conditional expressions
- when_then_otherwise: chained when expressions
- coalesce: null handling with coalesce
- null_comparison_equality: null equality/inequality semantics
- null_comparison_ordering: ordering comparisons vs NULL
- null_safe_equality: null-safe equality (eqNullSafe)
- null_in_filter: null handling in filter predicates
- type_coercion_numeric: int vs double comparison coercion
- type_coercion_mixed: int + double arithmetic coercion
- inner_join, left_join, right_join, outer_join: join operations
- groupby_multi_agg: multiple aggregations in one agg() call
- row_number_window, rank_window, lag_lead_window: window functions
- string_upper_lower, string_substring, string_concat, string_length_trim: string functions
- union_all, union_by_name, distinct, drop_columns, dropna, fillna, limit, with_column_renamed: DataFrame methods
- array_contains, element_at, array_size: array functions (split + array_contains/element_at/size)
- regexp_like, regexp_extract_all: string regex fixtures
- string_repeat_reverse, string_lpad_rpad: repeat, reverse, lpad, rpad
- math_sqrt_pow: sqrt, pow
- groupby_first_last: first, last aggregates
- cross_join, describe, replace, subtract, intersect: DataFrame methods
- Phase 12: first_row, head_n, offset_n: first/head/offset DataFrame methods
- Phase 20: groupby_median, with_bround: median, bround; OrderBy with nulls_first
- Phase 21: with_btrim, with_hex, with_conv, with_str_to_map, arrays_overlap, arrays_zip
- Phase 22: with_dayname, with_weekday, with_extract, with_unix_micros, make_timestamp_test, timestampadd_test, from_utc_timestamp_test
- Phase 23: with_isin, with_url_decode, with_url_encode, json_array_length_test, with_hash, with_shift_left
- ✅ Phase 25 (readiness for post-refactor merge): Plan interpreter (execute_plan(session, data, schema, plan) in Rust; Python robin_sparkless.execute_plan(data, schema, plan_json)); expression interpreter (all scalar functions; serialized expr trees → Polars Expr in src/plan/expr.rs); logical plan schema (LOGICAL_PLAN_FORMAT.md); plan fixtures (tests/fixtures/plans/filter_select_limit.json, join_simple.json, with_column_functions.json) with plan_parity_fixtures test; create_dataframe_from_rows (Rust + Python) for arbitrary schema and list of dicts/rows.
Next Steps to Full Sparkless Parity¶
To reach full Sparkless parity (robin-sparkless as a complete backend replacement), the remaining work is organized into phases 12–27 below (phases 9–25 complete). Phases 26–27 are crate publish and Sparkless integration. Reference: FULL_BACKEND_ROADMAP.md, PYSPARK_FUNCTION_MATRIX, GAP_ANALYSIS_SPARKLESS_3.28.md.
Phase overview¶
| Phase | Goal | Est. effort |
|---|---|---|
| 9 | High-value functions + DataFrame methods | ✅ COMPLETED |
| 10 | Complex types (Map, JSON, array_repeat, string 6.4) + window fixture simplification | ✅ COMPLETED |
| 11 | Parity scale (93 fixtures), harness date/datetime, converter + CI | ✅ COMPLETED |
| 12 | DataFrame methods parity (~55+ methods; freq_items, approx_quantile, crosstab, melt, sample_by, no-ops; PyO3 stat/na/to_pandas) | ✅ COMPLETED |
| 13 | Functions batch 1: string, binary, collection (~80 new → ~200 total) | ✅ COMPLETED |
| 14 | Functions batch 2: math, datetime, type/conditional (~100 new → ~300 total) | ✅ COMPLETED |
| 15 | Functions batch 3: remaining functions + fixture growth (88 → 150+ fixtures, 403 functions) | ✅ COMPLETED |
| 16 | Remaining gaps 1: string/regex (regexp_count, regexp_instr, regexp_substr, split_part, find_in_set, format_string, printf) | ✅ COMPLETED |
| 17 | Remaining gaps 2: datetime/unix (unix_timestamp, from_unixtime, make_date, timestamp_*, pmod, factorial) | ✅ COMPLETED |
| 18 | Remaining gaps 3: array/map/struct (array_append, array_prepend, array_insert, array_except/intersect/union, map_concat, map_from_entries, map_contains_key, get, named_struct, struct, map_filter, map_zip_with, zip_with) | ✅ COMPLETED |
| 19 | Remaining gaps 4: aggregates and try_* (any_value, bool_and, bool_or, count_if, max_by, min_by, percentile, product, try_add/divide/subtract/multiply/sum/avg, try_element_at, width_bucket, elt, bit_length, typeof) | ✅ COMPLETED |
| 20 | Full parity 1: ordering, aggregates, numeric | ✅ COMPLETED |
| 21 | Full parity 2: string, binary, type, array/map/struct | ✅ COMPLETED |
| 22 | Full parity 3: datetime extensions | ✅ COMPLETED |
| 23 | Full parity 4: JSON, CSV, URL, misc | ✅ COMPLETED |
| 24 | Full parity 5: bit, control, JVM stubs, random, crypto | ✅ COMPLETED |
| 25 | Readiness for post-refactor merge (plan interpreter, expression interpreter for all scalar functions, plan schema, 3 plan fixtures, create_dataframe_from_rows) | ✅ COMPLETED |
| 26 | Prepare and publish robin-sparkless as a Rust crate (crates.io, API stability, docs, release) | 2–3 weeks |
| 27 | Sparkless integration (BackendFactory "robin", 200+ tests), FFI/bindings surface | 4–6 weeks |
Phase 9 – High-value functions & DataFrame methods (4–6 weeks) ✅ COMPLETED¶
Goal: Complete remaining high-use functions and DataFrame methods so most Sparkless pipelines can run without falling back.
- Datetime:
current_date,current_timestamp,date_add,date_sub,hour,minute,second,datediff,last_day,trunc✅. (to_date, date_format, year, month, day ✅) - String:
repeat,reverse,instr/locate,lpad/rpad✅. (length, trim, regexp_*, split, initcap ✅) - Math:
abs,ceil,floor,round,sqrt,pow,exp,log✅; aggregatesfirst,last,approx_count_distinct✅. (stddev, variance, count_distinct ✅) - Conditional/null:
ifnull/nvl,nullif,nanvl✅. - DataFrame methods:
replace,crossJoin,describe,cache/persist/unpersist,subtract,intersect✅.
Outcome: Functions ~37+ → ~85+; DataFrame methods ~25 → ~35+. Parity harness extended with parse_with_column_expr for new functions and new Operation variants (Replace, CrossJoin, Describe, Subtract, Intersect). New fixtures: string_repeat_reverse, string_lpad_rpad, math_sqrt_pow, groupby_first_last, cross_join, describe, replace, subtract, intersect.
Phase 10 – Complex types & window parity ✅ COMPLETED¶
Goal: Implement Map, JSON, remaining array/string functions, and enable full window parity fixtures.
- Array:
array_repeat,array_flatten(Phase 8, Expr::map UDFs);array_exists,array_forall,array_filter,array_transform,array_sum,array_mean✅. - Map:
create_map,map_keys,map_values,map_entries,map_from_arrays(Phase 8; Map as List(Struct{key, value})) ✅. - JSON:
get_json_object,from_json,to_json✅; optionalbase64,unbase64(deferred). - String 6.4:
mask,translate,substring_index(Phase 10);soundex,levenshtein,crc32,xxhash64(Phase 8 UDFs) ✅. - Window: percent_rank, cume_dist, ntile, nth_value parity fixtures covered via multi-step workaround; documented in PYSPARK_DIFFERENCES.md.
Optional (document only): SQL extensions (subqueries, CTEs, HAVING, DDL); Delta schema evolution, MERGE. See PYSPARK_DIFFERENCES.md.
Outcome: Functions ~85+ → ~120+; Phase 8 and Phase 10 “remaining parity” items implemented.
Phase 11 – Parity scale & test conversion ✅ COMPLETED¶
Goal: Grow parity coverage and integrate Sparkless test conversion into CI.
- Parity harness: Date/datetime and boolean column support in fixture input (tests/parity.rs);
dtype_to_stringandcollect_to_simple_formatfor Date/Datetime/Int8;types_compatiblefor date/timestamp. - Fixture growth: 73 → 80 (Phase 11) → 82 (Phase 12–13) → 88 (Phase 14–15) → 93 (Phase 16) fixtures (Phase 16: regexp_count, regexp_substr, regexp_instr, split_part, find_in_set, format_string; array_distinct skipped).
- Converter: Date/timestamp type mapping added in tests/convert_sparkless_fixtures.py.
- CI: .github/workflows/ci.yml runs format, clippy, audit, deny, and Rust tests (including
pyspark_parity_fixtures). Locally:make check-full(Rust-only). Python tests and bindings live out-of-tree. - Docs: TEST_CREATION_GUIDE.md documents date/timestamp fixture format; SPARKLESS_PARITY_STATUS.md updated with CI note.
Outcome: 93 parity fixtures passing; CI runs parity; SPARKLESS_PARITY_STATUS kept current.
Phase 12 – DataFrame methods parity (4–6 weeks) ✅ COMPLETED¶
Goal: Implement the remaining ~50–60 DataFrame methods so robin-sparkless reaches the 85-method target before integration.
- Implemented:
sample,random_split,first,head,take,tail,is_empty,to_df,stat()(cov, corr),summary(alias describe),to_json,explain,print_schema,checkpoint,local_checkpoint,repartition,coalesce,select_expr,col_regex,with_columns,with_columns_renamed,na()(fill, drop),to_pandas,offset,transform,freq_items,approx_quantile,crosstab,melt,except_all,intersect_all,sample_by(stratified sampling); Spark no-ops:hint,is_local,input_files,same_semantics,semantic_hash,observe,with_watermark. - Parity: Fixtures
first_row.json,head_n.json,offset_n.jsonfor first/head/offset. - PyO3: All of the above exposed on Python DataFrame where applicable;
random_split,summary,to_df,select_expr,col_regex,with_columns,with_columns_renamed,stat()(returnsDataFrameStatwith cov/corr),na()(returnsDataFrameNawith fill/drop),to_pandas(same as collect; for use withpandas.DataFrame.from_records). - Outcome: DataFrame methods ~35 → ~55+; ready for function batches (Phase 13).
Phase 13 – Functions batch 1: string, binary, collection (4–6 weeks) ✅ COMPLETED¶
Goal: Add ~80 functions in string, binary, and collection categories toward ~200 total functions.
- String ✅ (partial):
ascii,format_number,overlay,position,char,chrimplemented;base64,unbase64(base64 crate). Remaining:format_string,encode/decode, etc. - Binary ✅ (partial):
sha1,sha2(bit_length),md5(string in → hex string out; sha1, sha2, md5 crates). AES_* deferred. - Collection ✅ (partial):
array_compact(drop nulls from list). Remaining: array_distinct, map extensions, etc. - Parity: Parser branches for all new functions; fixtures
string_ascii,string_format_number(82 fixtures at Phase 13 completion; 88 after Phase 15). - PyO3: Module and Column methods for ascii, format_number, overlay, position, char, chr, base64, unbase64, sha1, sha2, md5, array_compact.
- Outcome: Functions ~120 → ~130+; 82 parity fixtures at Phase 13. Phase 14 (84 fixtures), Phase 15 (88 fixtures, ~175+ functions).
Phase 14 – Functions batch 2: math, datetime, type/conditional ✅ COMPLETED¶
Goal: Add ~100 functions in math, datetime, type casting, and conditional logic toward ~300 total.
- Math ✅: sin, cos, tan, asin, acos, atan, atan2, degrees, radians, signum (UDFs).
- Datetime ✅: quarter, weekofyear, dayofweek, dayofyear, add_months, months_between, next_day (Polars dt + chrono UDFs).
- Type/conditional ✅: cast, try_cast (PySpark-like type names), isnan, greatest, least.
- Parity ✅: Parser branches for all; fixtures
math_sin_cos,datetime_quarter_week(84 fixtures). - PyO3 ✅: Module and Column methods for all Phase 14 functions. Docs updated.
Phase 15 – Functions batch 3: remaining + fixture growth (6–8 weeks) ✅ COMPLETED¶
Goal: Reach 403-function parity and grow parity fixtures from 88 to 150+.
- Functions: Batch 1 aliases (nvl, nvl2, substr, power, ln, ceiling, lcase, ucase, dayofmonth, to_degrees, to_radians, isnull, isnotnull), Batch 2 string (left, right, replace, startswith, endswith, contains, like, ilike, rlike), Batch 3 math (cosh, sinh, tanh, acosh, asinh, atanh, cbrt, expm1, log1p, log10, log2, rint, hypot), Batch 4 array_distinct — all implemented. See PHASE15_GAP_LIST.md, GAP_ANALYSIS_SPARKLESS_3.28.md.
- Fixtures: 82 → 88 hand-written; target 150+ with
convert_sparkless_fixtures.pywhen Sparkless expected_outputs available. - Outcome: Phase 15 scope done; remaining gaps covered in Phases 16–19.
Phase 16 – Remaining gaps 1: string and regex (2–3 weeks) ✅ COMPLETED¶
Goal: Implement remaining string/regex functions from PHASE15_GAP_LIST.md and GAP_ANALYSIS_SPARKLESS_3.28.md.
- String/regex:
regexp_count,regexp_instr,regexp_substr,split_part,find_in_set,format_string,printf— all implemented. (btrim,locate,conv→ Phase 21.) - Parity: Parser branches and fixtures (
regexp_count,regexp_substr,regexp_instr,split_part,find_in_set,format_string). - PyO3: Exposed on Column and module.
- Outcome: String/regex gap closed; ready for Phase 17.
Phase 17 – Remaining gaps 2: datetime and unix (2–3 weeks) ✅ COMPLETED¶
Goal: Implement datetime/unix and remaining math from the gap list.
- Datetime/unix:
unix_timestamp,to_unix_timestamp,from_unixtime,make_date,timestamp_seconds,timestamp_millis,timestamp_micros,unix_date,date_from_unix_date— all implemented; optionalconvert_timezone,current_timezone,now,curdate,localtimestampdeferred. - Math:
pmod,factorial— implemented. - Parity: Parser branches and 10 fixtures.
- PyO3: Exposed on Column and module.
- Outcome: Datetime/unix gap closed; ready for Phase 18.
Phase 18 – Remaining gaps 3: array, map, struct ✅ COMPLETED¶
Goal: Implement remaining array/map/struct functions.
- Array:
array_append,array_prepend,array_insert,array_except,array_intersect,array_union,zip_with(UDF + list.eval). (arrays_overlap,arrays_zip,explode_outer,posexplode_outer,array_agg→ Phase 21.) - Map:
map_concat,map_filter,map_zip_with(list.eval / UDF + list.eval),map_from_entries,map_contains_key,get(map element). (str_to_map→ Phase 21.) - Struct:
named_struct,struct. (transform_keys,transform_values→ Phase 21.) - Parity: Parser for struct-field expressions; fixtures
map_filter,zip_with,map_zip_with(121 → 124). - PyO3:
map_filter_value_gt,zip_with_coalesce,map_zip_with_coalesceconvenience helpers. - Outcome: Array/map/struct gap closed including deferred functions; ready for Phase 19.
Phase 19 – Remaining gaps 4: aggregates and try_ ✅ COMPLETED*¶
Goal: Implement remaining aggregates and try_* / misc functions.
- Aggregates:
any_value,bool_and,bool_or,every/some,count_if,max_by,min_by,percentile,product,collect_list,collect_set— all implemented.percentile_approxdeferred. - Try_*:
try_divide,try_add,try_subtract,try_multiply,try_element_at— all implemented. - Misc:
width_bucket,elt,bit_length,typeof. - Parity:
groupby_any_value,groupby_product,try_divide,width_bucket(124 → 128). - PyO3: GroupedData methods; try_*, width_bucket, elt, bit_length, typeof.
- Outcome: Aggregates and try_* gap closed; ready for Phase 20 (full parity part 1).
Phase 20 – Full parity 1: ordering, aggregates, numeric (1.5–2 weeks) ✅ COMPLETED¶
Goal: High-value ordering and aggregate functions. Reference: PARITY_CHECK_SPARKLESS_3.28.md.
- Ordering ✅:
asc,asc_nulls_first,asc_nulls_last,desc,desc_nulls_first,desc_nulls_last(return SortOrder for use in orderBy);DataFrame::order_by_exprs(sort_orders). - Aggregates ✅:
median,mode;stddev_pop,stddev_samp,var_pop,var_samp;try_sum,try_avg. Deferred:covar_pop,covar_samp,corr(as groupBy agg),kurtosis,skewness,percentile_approx. - Numeric ✅:
bround;negate,negative,positive;cot,csc,sec;e,pi(constants).
Parity: Fixtures groupby_median, with_bround; OrderBy supports optional nulls_first. Outcome: ~25 new functions; ready for Phase 21.
Phase 21 – Full parity 2: string, binary, type, array/map/struct (2 weeks) ✅ COMPLETED¶
Goal: String/binary/type and collection extensions.
- String:
btrim;locate;conv(base conversion). ✅ - Binary:
hex,unhex;bin;getbit. Defer:decode,encode;to_binary,try_to_binary. ✅ - Type/cast:
to_char,to_number,to_varchar;try_to_number,try_to_timestamp. ✅ - Array:
array_agg;arrays_overlap,arrays_zip;explode_outer,posexplode_outer. Defer:aggregate. ✅ - Map:
str_to_map. ✅ - Struct:
transform_keys,transform_values. ✅
Parity: Fixtures with_btrim, with_hex, with_conv, with_str_to_map, arrays_overlap, arrays_zip. Outcome: ~20 new functions; ready for Phase 22.
Phase 22 – Full parity 3: datetime extensions ✅ COMPLETED¶
Goal: Complete datetime function set.
convert_timezone,current_timezone;curdate;date_diff,date_part;dateadd,datepart;dayname,weekday;days,hours,months,years;extract;localtimestamp;now.make_timestamp,make_timestamp_ntz,make_interval;timestampadd,timestampdiff;to_timestamp;from_utc_timestamp,to_utc_timestamp;unix_micros,unix_millis,unix_seconds.
Parity: Fixtures with_dayname, with_weekday, with_extract, with_unix_micros, make_timestamp_test, timestampadd_test, from_utc_timestamp_test; with_curdate_now skipped (non-deterministic). Outcome: ~25 new functions; ready for Phase 23.
Phase 23 – Full parity 4: JSON, CSV, URL, misc ✅ COMPLETED¶
Goal: JSON/CSV/URL and misc helpers.
- JSON:
json_array_length,parse_url. Defer:json_object_keys,json_tuple. - Schema/I/O: Defer
from_csv,to_csv,schema_of_csv,schema_of_json. - URL:
url_decode,url_encode. - Misc:
isin,isin_i64,isin_str;equal_null;hash;shiftLeft,shiftRight,shiftRightUnsigned;stack;version. Defer:inline,inline_outer,sentences,sequence,shuffle,call_function.
Parity: Fixtures with_isin, with_url_decode, with_url_encode, json_array_length_test, with_hash, with_shift_left. Outcome: ~18 new functions; ready for Phase 24.
Phase 24 – Full parity 5: bit, control, JVM stubs, random, crypto (1.5–2 weeks) ✅ COMPLETED¶
Goal: Bit operations, control flow, JVM compatibility stubs, random, crypto.
- Bit ✅:
bit_and,bit_or,bit_xor,bit_count,bit_get;bitwiseNOT,bitwise_not. (bitmap_*deferred if in Sparkless 3.28.) - Control ✅:
assert_true,raise_error. - JVM stubs ✅:
broadcast,spark_partition_id,input_file_name,monotonically_increasing_id,current_catalog,current_database,current_schema,current_user,user— no-ops or placeholders for API compatibility. - Random ✅:
rand(seed),randn(seed)— real RNG with optional seed; one value per row when used inwith_column/with_columns(PySpark-like).udf,pandas_udf— stub or minimal support. - Crypto ✅:
aes_encrypt,aes_decrypt,try_aes_decrypt(AES-128-GCM). See PYSPARK_DIFFERENCES.md.
Defer: regr_* (regression), XML (from_xml, to_xml, xpath_*), ML/HLL — document as out of scope.
Parity: Fixture with_bit_ops; with_rand_seed, with_jvm_stubs. PyO3: All new functions exposed. Outcome: Full parity 5 done; ready for Phase 25 (readiness for post-refactor merge).
Phase 25 – Readiness for post-refactor merge (3–4 weeks) ✅ COMPLETED¶
Goal: Prepare robin-sparkless so that when Sparkless completes its refactor plan (serializable logical plan and materialize_from_plan), integration is a thin adapter. See READINESS_FOR_SPARKLESS_PLAN.md for full detail.
- Plan interpreter ✅:
execute_plan(session, data, schema, plan)in Rust (src/plan/); exposed in Python asrobin_sparkless.execute_plan(data, schema, plan_json)returning a DataFrame (call.collect()for list of dicts). - Logical plan schema ✅: docs/LOGICAL_PLAN_FORMAT.md defines op names, payload shapes, and expression tree format.
- Expression interpreter ✅:
src/plan/expr.rsturns serialized expression trees into PolarsExpr. Supports col, lit, comparison/logical ops, eq_null_safe, and all scalar functions (string, math, datetime, type/conditional, bit, array, map, misc; two-arg when). See LOGICAL_PLAN_FORMAT.md. - Plan-based fixtures and tests ✅:
tests/fixtures/plans/(filter_select_limit, join_simple);plan_parity_fixturestest intests/parity.rs. - Flexible DataFrame creation ✅: Rust
create_dataframe_from_rows(rows, schema)in session; PythoncreateDataFrame(data, schema=None)on SparkSession (#372).
Outcome: When Sparkless emits a logical plan, we can execute it via execute_plan; Sparkless robin backend becomes a thin wrapper. Ready for Phase 26 (crate publish).
Phase 26 – Prepare and publish robin-sparkless as a Rust crate (2–3 weeks)¶
Goal: Make the library ready for public use as a Rust dependency before Sparkless integration.
- Crate metadata: Finalize
Cargo.toml(description, license, repository, keywords, categories); ensure semver and version are release-ready. - API surface: Review public API for stability; document breaking-change policy (e.g. semver for 0.x); consider
#[deprecated]or feature flags for experimental APIs. - Documentation:
cargo docbuilds cleanly; add or expand crate-level and module docs; link from README to docs.rs (or hosted docs). - Release workflow: Tag releases; publish to crates.io (
cargo publish). - CI: Ensure CI runs full check (format, clippy, audit, deny, tests, benchmarks) on release branches/tags; document how to cut a release in CONTRIBUTING or README.
Outcome: robin-sparkless is published on crates.io; consumers can add it as a dependency; clear release and versioning process.
Phase 27 – Sparkless integration (4–6 weeks)¶
Goal: Make robin-sparkless a runnable backend for Sparkless while keeping the Rust API stable.
- Sparkless repo: Add "robin" backend option to BackendFactory; when selected, delegate DataFrame execution to robin-sparkless via FFI using the published Rust crate. After Sparkless refactor, wire
materialize_from_planto ourexecute_plan. - Fallback: When an operation is not supported, raise a clear error or fall back to Sparkless’s existing backend; document behavior.
- Target: 200+ Sparkless tests passing with robin backend (current: 0).
Outcome: Sparkless can run against robin-sparkless; 200+ tests passing; backend behavior matches the documented function set.
Ongoing / optional (lower priority)¶
- Trait-based backend:
QueryExecutor,DataMaterializerfor pluggability (FULL_BACKEND_ROADMAP Phase 1 Future). - Expression model doc: Document Column/Expr mapping to Sparkless
ColumnOperationtrees. - UDFs: Pure-Rust UDFs, scalar and column-wise vectorized Python UDFs via
spark.udf().register, and grouped vectorized aggregation UDFs viapandas_udf(..., function_type="grouped_agg"). Full pandas_udf decorator surface and UDTFs remain out of scope. - Memory / scale: Optional memory profiling and large-dataset handling.
Summary metrics (full parity targets)¶
| Metric | Current | After Phase 22 | After Phase 24 (full parity) | After Phase 25 (readiness) | After Phase 26 (crate) | Full Backend (Phase 27) |
|---|---|---|---|---|---|---|
| Parity fixtures | 159 | 159 | 180+ | 180+ | 180+ | 180+ |
| Functions | ~283 | ~283 | ~280 | ~280 | ~280 | ~280 |
| DataFrame methods | ~55+ | ~55+ | ~55+ | ~55+ | ~55+ | 85 |
| Plan interpreter / execute_plan | Yes | — | — | Yes | Yes | Yes |
| Crate on crates.io | No | — | — | — | Yes | Yes |
| Sparkless tests passing (robin backend) | 0 | — | — | — | — | 200+ |
Testing Strategy¶
To enforce the roadmap above, we will:
- Use PySpark as the oracle
- ✅ Maintain a small Python tool (
tests/gen_pyspark_cases.py) that:- ✅ Runs pipelines in PySpark.
- ✅ Emits JSON fixtures describing inputs, operations, and expected outputs.
- ✅ Fixtures live under
tests/fixtures/and are versioned with the repo. -
Sparkless test conversion: Build a fixture converter to reuse Sparkless's 270+ expected_outputs; see SPARKLESS_INTEGRATION_ANALYSIS.md §4.
-
Drive Rust tests from fixtures
- ✅
tests/parity.rs:- ✅ Reconstructs a
DataFramefrom each fixture'sinput. - ✅ Applies the listed operations (
filter,select,groupBy+agg,orderBy, etc.) via the Rust API. - ✅ Collects results and compares schema + rows against
expected, with well-defined tolerances (e.g. for floats, or order-insensitive comparisons where PySpark doesn't guarantee ordering).
- ✅ Reconstructs a
-
✅ Plan fixtures (Phase 25):
plan_parity_fixturesloadstests/fixtures/plans/*.json, runsplan::execute_plan, and asserts schema and rows match expected. Fixtures use the serialized plan format in LOGICAL_PLAN_FORMAT.md. -
Track parity coverage
- ✅ Initial parity test infrastructure in place
- ✅ Maintain a parity matrix (operations × data types × edge cases) in
PARITY_STATUS.md(seePARITY_STATUS.md)- Each cell indicates whether it's covered (and by which fixture), not yet covered, or intentionally diverges.
This testing strategy makes PySpark behavior the reference and gives us a clear, automated way to detect regressions as the Rust API and Polars versions evolve.