Skip to content

Implementation Status: Polars Migration

Strategic Direction: Sparkless Backend Replacement

Robin-sparkless is designed to replace the backend logic of Sparkless—the Python PySpark drop-in replacement. Sparkless would call robin-sparkless via FFI for DataFrame execution. See SPARKLESS_INTEGRATION_ANALYSIS.md for architecture mapping, structural learnings, and test conversion strategy.

Build & Test Status

  • cargo check passes for the Rust-only, Polars-backed implementation.
  • There are no outstanding Rust compiler errors.
  • cargo test passes (unit/integration/doc tests).
  • make test runs Rust tests (wrapper for cargo test).
  • make check runs Rust formatting, clippy, audit, deny, and tests.
  • make sparkless-parity runs parity over hand-written and (if present) converted fixtures; set SPARKLESS_EXPECTED_OUTPUTS to convert from Sparkless first.

✅ Completed

1. Rust Core (default build)

  • Default build is pure Rust. Library exposes a Rust API.
  • Earlier phases included an optional robin_sparkless Python module. That historical module is gone, but this repo now ships the Sparkless v4 Python package under python/ (PyO3-based native extension crate sparkless-native + Python wrapper package sparkless). Other language bindings may live out-of-tree and call the Rust crate via FFI.

2. Polars Integration

  • DataFrame uses Polars LazyFrame internally (#438): transformations extend the lazy plan; only actions (collect, show, count, write, etc.) trigger materialization. Data sources (read_csv, read_parquet, read_json) return lazy DataFrames.
  • Column is a thin wrapper around Polars Expr.
  • Basic helpers implemented in functions.rs for literals and aggregates.

3. Session API + IO

  • SparkSession and SparkSessionBuilder are the Rust-facing entry point.
  • File readers are implemented via Polars IO:
  • SparkSession::read_csv
  • SparkSession::read_parquet
  • SparkSession::read_json

4. PySpark Parity Harness

  • tests/gen_pyspark_cases.py generates JSON fixtures from PySpark.
  • tests/parity.rs runs the fixtures through robin-sparkless and asserts parity.
  • Parity coverage is tracked in PARITY_STATUS.md.

⚙️ In Progress / Planned (toward broader PySpark parity)

  1. PySpark-inspired API surface
  2. Clarify which PySpark methods we intend to emulate first.
  3. Align naming and signatures (adapted to Rust) for SparkSession, DataFrame, Column.

  4. Behavioral Parity Slice

  5. Continue expanding parity coverage by adding fixtures for new capabilities and edge cases.
  6. Current fixture coverage and status lives in PARITY_STATUS.md.

  7. JoinsCOMPLETED

  8. ✅ Implemented common join types (inner, left, right, outer) via DataFrame::join()
  9. ✅ Parity fixtures for inner, left, right, outer joins

  10. String functionsCOMPLETED

  11. upper(), lower(), substring() (1-based), concat(), concat_ws()
  12. ✅ Parity fixtures: string_upper_lower, string_substring, string_concat
  13. Expand built-in functions (date/math) with explicit PySpark semantics.
  14. Add additional type coercion and null-handling edge cases as fixtures.

  15. Window functionsCOMPLETED

  16. Column::rank(), row_number(), dense_rank(), lag(), lead() with .over(partition_by)
  17. ✅ Parity fixtures: row_number_window, rank_window, lag_lead_window
  18. SparkSession::sql() implemented (optional sql feature); temp views and in-memory saved tables (saveAsTable, write_delta_table); catalog listTables, tableExists, dropTempView, dropTable; see QUICKSTART.md, EMBEDDING.md.

  19. Language bindings (out-of-tree)

  20. The legacy robin_sparkless Python module is removed. The supported Python integration for v4 is the in-repo python/ package (sparkless + sparkless-native). Other language bindings may live out-of-tree and call the Rust crate via FFI; see EMBEDDING.md.

  21. Phase 5 Test ConversionCOMPLETED

  22. Fixture converter maps Sparkless expected_outputs to robin-sparkless format (join, window, withColumn, union, distinct, drop, dropna, fillna, limit, withColumnRenamed, etc.).
  23. Parity discovers tests/fixtures/ and tests/fixtures/converted/; optional skip: true in fixtures.
  24. make sparkless-parity (set SPARKLESS_EXPECTED_OUTPUTS to run converter first); 159 hand-written fixtures passing (array_distinct, with_curdate_now skipped).
  25. See CONVERTER_STATUS.md, SPARKLESS_PARITY_STATUS.md.

  26. Phase 6 Broad Function Parity (partial) ✅

  27. Array: array_size/size, array_contains, element_at, explode, array_sort, array_join, array_slice; implemented (via Polars list.eval): array_position, array_remove, posexplode, array_exists, array_forall, array_filter, array_transform, array_sum, array_mean; Phase 8: array_repeat, array_flatten implemented via map UDFs. Fixtures: array_contains, element_at, array_size, array_sum.
  28. Window: first_value, last_value, percent_rank, cume_dist, ntile, nth_value with .over(); parity fixtures for all (percent_rank/cume_dist/ntile/nth_value via multi-step workaround).
  29. String: regexp_extract_all, regexp_like; Phase 10: mask, translate, substring_index; Phase 8: soundex, levenshtein, crc32, xxhash64 implemented via map UDFs.
  30. Phase 10: JSON get_json_object, from_json, to_json. Phase 8: Map create_map, map_keys, map_values, map_entries, map_from_arrays implemented. See PYSPARK_DIFFERENCES.md.

  31. Phase 7 SQL & AdvancedCOMPLETED

  32. Optional SQL (sql feature): SparkSession::sql(query), temp views and in-memory tables (create_or_replace_temp_view, table(name), df.write().saveAsTable(name, mode), write_delta_table(name)); catalog listTables, tableExists, dropTempView, dropTable; read_delta(name_or_path) (path → Delta on disk, name → in-memory table); sqlparser → DataFrame ops (SELECT, FROM, WHERE, JOIN, GROUP BY, ORDER BY, LIMIT).
  33. Optional Delta Lake (delta feature): read_delta, read_delta_with_version (time travel), write_delta (overwrite/append) via delta-rs.
  34. Performance: cargo bench (criterion) compares robin-sparkless vs Polars; target within ~2x. Error messages improved; Troubleshooting in QUICKSTART.md.

  35. Path to 100% before Sparkless integration (ROADMAP.md Phases 12–21)

  36. Phase 12COMPLETED: DataFrame methods parity — implemented sample, random_split, first, head, take, tail, is_empty, to_df, stat (cov/corr), summary, to_json, explain, print_schema, checkpoint, local_checkpoint, repartition, coalesce, select_expr, col_regex, with_columns, with_columns_renamed, na (fill/drop), to_pandas, offset, transform, except_all, intersect_all; freq_items, approx_quantile, crosstab, melt (full implementations in Rust); sample_by (stratified sampling); Spark no-ops (hint, is_local, input_files, same_semantics, semantic_hash, observe, with_watermark). Parity fixtures: first_row, head_n, offset_n. Methods ~35 → ~55+.
  37. Phase 13 ✅ (completed): Functions batch 1 — string: ascii, format_number, overlay, position, char, chr; base64, unbase64; binary: sha1, sha2, md5 (hex string out); collection: array_compact. Parity: parse_with_column_expr branches and fixtures string_ascii, string_format_number (82 fixtures total). Dependencies: base64, sha1, sha2, md5 crates.
  38. Phase 14 ✅ (completed): Functions batch 2 — math: sin, cos, tan, asin, acos, atan, atan2, degrees, radians, signum (UDFs in udfs.rs); datetime: quarter, weekofyear, dayofweek, dayofyear, add_months, months_between, next_day (Polars dt + chrono UDFs); type/conditional: cast, try_cast (parse_type_name + strict_cast/cast), isnan, greatest, least (UDF apply_greatest2/apply_least2). Parity: parser branches for all; fixtures math_sin_cos, datetime_quarter_week (84 fixtures).
  39. Phase 15COMPLETED: Functions batch 3 — Batch 1 (nvl, nvl2, substr, power, ln, ceiling, lcase, ucase, dayofmonth, to_degrees, to_radians, isnull, isnotnull), Batch 2 (left, right, replace, startswith, endswith, contains, like, ilike, rlike), Batch 3 (cosh, sinh, tanh, acosh, asinh, atanh, cbrt, expm1, log1p, log10, log2, rint, hypot), Batch 4 (array_distinct) implemented; parity fixtures 84 → 88. Gap list: PHASE15_GAP_LIST.md, GAP_ANALYSIS_SPARKLESS_3.28.md.
  40. Phase 16COMPLETED: String/regex — regexp_count, regexp_instr, regexp_substr, split_part, find_in_set, format_string, printf. Parity fixtures: regexp_count, regexp_substr, regexp_instr, split_part, find_in_set, format_string (93 fixtures total; array_distinct skipped). See ROADMAP.md.
  41. Phase 17COMPLETED: Datetime/unix — unix_timestamp, to_unix_timestamp, from_unixtime, make_date, timestamp_seconds, timestamp_millis, timestamp_micros, unix_date, date_from_unix_date; math: pmod, factorial. Parity fixtures: unix_timestamp, from_unixtime, make_date, timestamp_seconds, timestamp_millis, timestamp_micros, unix_date, date_from_unix_date, pmod, factorial (103 fixtures total).
  42. Phase 18COMPLETED: Array/map/struct — array_append, array_prepend, array_insert, array_except/intersect/union, zip_with, map_concat, map_filter, map_zip_with, named_struct (124 fixtures).
  43. Phase 19COMPLETED: Aggregates (any_value, bool_and, bool_or, count_if, max_by, min_by, percentile, product, collect_list, collect_set), try_* (try_divide, try_add, try_subtract, try_multiply), misc (width_bucket, elt, bit_length, typeof). Parity fixtures: groupby_any_value, groupby_product, try_divide, width_bucket (128 fixtures).
  44. Phase 20COMPLETED: Ordering (asc, desc, nulls_first/last); aggregates (median, mode, stddev_pop, var_pop, try_sum, try_avg); numeric (bround, negate, positive, cot, csc, sec, e, pi). Parity: groupby_median, with_bround; order_by_exprs.
  45. Phase 21COMPLETED: String (btrim, locate, conv); binary (hex, unhex, bin, getbit); type (to_char, to_varchar, to_number, try_to_number, try_to_timestamp); array (arrays_overlap, arrays_zip, explode_outer, posexplode_outer, array_agg); map (str_to_map); struct (transform_keys, transform_values). Parity: with_btrim, with_hex, with_conv, with_str_to_map, arrays_overlap, arrays_zip (136 fixtures).
  46. Phase 23 ✅ (completed): JSON, URL, misc (isin, url_decode, url_encode, json_array_length, parse_url, hash, shift_left, shift_right, version, equal_null, stack). Parity: with_isin, with_url_decode, with_url_encode, json_array_length_test, with_hash, with_shift_left (159 fixtures).
  47. Phase 22 ✅ (completed): Datetime extensions — curdate, now, localtimestamp, date_diff, dateadd, datepart, extract, date_part, unix_micros/millis/seconds, dayname, weekday, make_timestamp, make_timestamp_ntz, make_interval, timestampadd, timestampdiff, days, hours, minutes, months, years, from_utc_timestamp, to_utc_timestamp, convert_timezone, current_timezone, to_timestamp. Parity: fixtures with_dayname, with_weekday, with_extract, with_unix_micros, make_timestamp_test, timestampadd_test, from_utc_timestamp_test.
  48. Phase 24COMPLETED: Bit ops, control (assert_true, raise_error), JVM stubs, rand/randn (real RNG, optional seed; per-row values when used in with_column/with_columns), AES crypto (aes_encrypt, aes_decrypt, try_aes_decrypt; AES-128-GCM).
  49. Phases 23–24: Full parity (JSON/CSV/URL, bit/control/JVM/random/crypto).
  50. Phase 25COMPLETED: Readiness for post-refactor merge — plan interpreter (execute_plan), expression interpreter (all scalar functions; serialized expr → Expr in src/plan/expr.rs), logical plan schema (LOGICAL_PLAN_FORMAT.md), 3 plan fixtures (tests/fixtures/plans/), create_dataframe_from_rows (Rust). See READINESS_FOR_SPARKLESS_PLAN.md.
  51. Phase 26: Prepare and publish robin-sparkless as a Rust crate (crates.io, API stability, docs, release workflow).
  52. Phase 27: Sparkless integration (BackendFactory "robin", 200+ tests passing).

  53. Sparkless integration (in Sparkless repo, after Phase 17)

  54. Fixture converter: Sparkless expected_outputs/ JSON → robin-sparkless fixtures
  55. Structural alignment: service-style modules, trait-based backends, case sensitivity
  56. Function parity: use PYSPARK_FUNCTION_MATRIX as checklist