Skip to content

Full Parity Roadmap: Robin-Sparkless vs PySpark

A phased plan to achieve full API and behavioral parity between robin-sparkless and Apache PySpark. Based on GAP_ANALYSIS_PYSPARK_REPO.md, ROBIN_SPARKLESS_MISSING.md, and ROADMAP.md.


Current State (February 2026)

Area Robin-Sparkless PySpark Gap
Functions ~295+ implemented ~415 ~120 (many are aliases or param-name mismatches)
DataFrame methods ~80+ (Phase D: view, corr/cov, aliases, stubs) ~95 ~15
DataFrameReader spark.read().option/options/format/load/table/csv/parquet/json 12+ methods ~6 (jdbc, orc, text, schema full impl)
DataFrameWriter option/options/partition_by/parquet/csv/json, saveAsTable (in-memory) 16+ methods ~9 (bucketBy, insertInto, orc, text)
Column methods Many as module functions 17 in class Structural (robin uses F.xxx style)
GroupedData Strong 10 methods ~2
SparkSession Core 36 methods ~25
Catalog dropTempView, listTables, tableExists, dropTable (in-memory) 27 methods ~23 (createTable, getTable, etc. stubs)
Window Core 4 WindowSpec methods 4

Parity fixtures: 201 passing (11 skipped). Target: 200+ for full confidence. Phase-specific tests: make test-parity-phase-amake test-parity-phase-g; see PARITY_STATUS.md and tests/fixtures/phase_manifest.json.


Phases Overview

Phase Focus Est. Effort Priority
A Signature alignment (param names) 2–3 weeks High
B Missing high-value functions 3–4 weeks High
C DataFrameReader / DataFrameWriter parity 2–3 weeks High
D DataFrame method gaps 2–3 weeks Medium
E SparkSession & Catalog stubs 1–2 weeks Medium
F Behavioral alignment (diverges) 2–3 weeks Medium
G Parity fixture expansion 2–3 weeks High
H Deferred / optional (XML, UDF, streaming) Low

Phase A: Signature Alignment (2–3 weeks) — COMPLETED

Goal: Align parameter names and optional args so existing PySpark code works with minimal changes.

Scope: 205 functions/methods classified as "partial" (same logic, different param names).

Category PySpark style Robin style Action
Column arg col column Add col as alias or primary
Numeric args months, days n Add PySpark param names
Optional args errMsg=None (missing) Add optional params with defaults
Array/struct cols / *cols varargs Align signatures

Deliverables (done): - PyO3 bindings use col and PySpark camelCase param names (errMsg, fromBase, dayOfWeek, etc.) - Optional params (e.g. assert_true(col, errMsg=None)) with PySpark defaults - Gap analysis partial count reduced from 205 to 21 (exact: 199). See GAP_ANALYSIS_PYSPARK_REPO.md - Enhanced extract_robin_api_from_source.py parses #[pyo3(signature=...)] for accurate comparison - make gap-analysis-runtime target added for introspection-based gap analysis - Default normalization in gap analysis (0/'0', None/'None') for fair comparison - PyO3 signatures: bround(scale=0), locate(pos=1), btrim(trim=None), from_unixtime(format=None), overlay(len=-1) - PySpark param renames: conv(fromBase, toBase), convert_timezone(sourceTz, targetTz, sourceTs), sha2/shift_left/shift_right(numBits), months_between(roundOff), next_day(dayOfWeek), like/ilike(escapeChar), parse_url(partToExtract), assert_true/raise_error(errMsg), json_tuple(*fields), regexp_extract_all(str, regexp), make_timestamp(years, months, ...)


Phase B: Missing High-Value Functions (3–4 weeks) — COMPLETED

Goal: Implement functions that appear in PySpark gap analysis as "missing" but are commonly used.

Deliverables (done): - Tier 1 functions exposed in src/python/mod.rs: abs, date_add, date_sub, date_format, current_date, current_timestamp, char_length, character_length, date_trunc, array, array_contains, array_max, array_min, array_position, array_size, size, array_join, array_sort, cardinality - Aliases: ceil → ceiling, mean → avg, std → stddev, sign → signum - Varargs: array(*cols) via #[pyo3(signature = (*cols))] - aggregate(col, zero) exposed for array fold - plan/expr.rs: Added array, array_max, array_min, cardinality, char_length, character_length, date_trunc to expr_from_fn_rest - Parity fixtures: abs, array_from_cols, date_format, char_length, current_date_timestamp (skipped; non-deterministic) - date_format: Now accepts PySpark/Java SimpleDateFormat (e.g. yyyy-MM) via pyspark_format_to_chrono conversion - Gap analysis: missing reduced from 195 → 171; exact increased 199 → 214

Medium priority (deferred): - bucket, call_function (stub or narrow implementation) - cume_dist, percent_rank (window) — ensure full parity - window, window_time — thin wrappers over .over()


Phase C: DataFrameReader / DataFrameWriter Parity (2–3 weeks) — COMPLETED

Goal: Expose full Reader/Writer API so spark.read.* and df.write.* match PySpark.

Deliverables (done): - DataFrameReader (Rust): option, options, format, schema (stub), load, table, csv, parquet, json — options applied (header, inferSchema, sep, nullValue) - DataFrameWriter (Rust): option, options, partition_by, parquet, csv, json — format helpers; partitionBy stored (partitioned write stubbed) - PyDataFrameReader: read() on SparkSession; option, options, format, schema, load, table, csv, parquet, json - PyDataFrameWriter: option, options, partition_by, parquet, csv, json - Parity fixtures: read_csv_with_options (reader_options), read_table (table_source) - spark.read.csv(...), spark.read.option("header","true").csv(path), spark.read.format("parquet").load(path), spark.read.table("name") work - df.write.mode("overwrite").parquet(path), df.write.option("header","true").csv(path) work - saveAsTable(name, mode) implemented (in-memory; session-scoped). Stubs: jdbc, orc, text, bucketBy, insertInto (out of scope)


Phase D: DataFrame Method Gaps (2–3 weeks) — COMPLETED

Goal: Implement remaining DataFrame methods from gap analysis (52 missing).

Deliverables (done): - View methods: df.createOrReplaceTempView(name), createTempView, createGlobalTempView, createOrReplaceGlobalTempView — use default session from get_or_create - corr/cov: df.corr() (matrix), df.corr(col1, col2) (scalar), df.cov(col1, col2) (scalar) - Aliases: toDF, toJSON, toPandas + snake_case to_df, to_json, to_pandas - Exposed stubs: hint, repartitionByRange, sortWithinPartitions, sameSemantics, semanticHash, columns, cache, isLocal, inputFiles - writeTo stub: raises NotImplementedError (use df.write().parquet(path) instead) - checkpoint, local_checkpoint, observe, withWatermark already present


Phase E: SparkSession & Catalog Stubs (1–2 weeks) — COMPLETED

Goal: Expose SparkSession and Catalog methods for API compatibility.

Deliverables (done): - Catalog class: dropTempView, dropGlobalTempView, listTables(dbName=None) (names from temp views + saved tables), tableExists(tableName, dbName=None), dropTable(tableName) (in-memory saved tables only), currentDatabase, currentCatalog, listDatabases, listCatalogs (functional where supported); cacheTable, uncacheTable, clearCache, refreshTable, refreshByPath, recoverPartitions (no-op); createTable, createExternalTable, getDatabase, getFunction, getTable, registerFunction (NotImplementedError); databaseExists, functionExists, isCached, listColumns, listFunctions, setCurrentCatalog, setCurrentDatabase (stub/fixed). - RuntimeConfig (spark.conf): get, getAll, set (NotImplementedError), isModifiable. - SparkSession: catalog(), conf(), newSession(), stop(), range(end) / range(start, end, step), version, udf() (NotImplementedError), getActiveSession(), getDefaultSession() (classmethods). - Rust session: drop_temp_view, drop_global_temp_view, table_exists, list_temp_view_names, list_table_names, drop_table, range(start, end, step). - Tests: test_phase_e_spark_session_catalog.


Phase F: Behavioral Alignment (2–3 weeks) — COMPLETED

Goal: Reduce semantic divergences documented in PYSPARK_DIFFERENCES.md.

Deliverables (done): - assert_true: Aligned with PySpark — fails on false or null; returns null on success. - raise_error: Error message uses user message directly (no prefix). - rand/randn: Documented; per-row when used via with_column/with_columns. - from_utc_timestamp/to_utc_timestamp: Documented (identity for UTC; full conversion deferred). - unix_timestamp/from_unixtime: Documented timezone assumptions. - AES: Documented mode/padding (GCM only; CBC etc. not supported). - Fixtures: assert_true.json, assert_true_err_msg.json (void/null); assert_true_null_fails, assert_true_false_fails (skipped).


Phase G: Parity Fixture Expansion (2–3 weeks) — COMPLETED

Goal: Grow parity coverage from 159 to 200+ fixtures.

Actions: - Convert more Sparkless expected_outputs via convert_sparkless_fixtures.py - Add fixtures for new functions (Phase B) - Add fixtures for Reader/Writer options (Phase C) - Add fixtures for signature alignment (Phase A) - Run make sparkless-parity in CI

Deliverables (done): - 201 parity fixtures passing (40+ hand-written fixtures added: filter_age_lt_25, filter_name_eq, select_single_column, groupby_count_desc, limit_one/three, orderby_desc, with_column_lit, distinct_all, fillna_simple, filter_then_select, groupby_sum_simple, filter_ge/le/ne, filter_or_simple, filter_eq_lit, select_reorder, cross_join_small, union_three_rows, and more) - CI runs full parity suite (cargo nextest run --test parity)


Phase H: Deferred / Optional — COMPLETED (documented)

Explicitly out of scope for full parity (document only): - XML / XPath: from_xml, to_xml, schema_of_xml, xpath* — require XML parser - UDF / UDTF: Full PySpark-style pandas_udf decorator (all function types), udtf, and advanced UDF integration in SQL/streaming. Scalar UDFs (Rust and Python), column-wise vectorized UDFs in withColumn/select, and grouped vectorized aggregation UDFs via pandas_udf(..., function_type="grouped_agg") in groupBy().agg(...) are implemented; everything else in this space is documented as optional/deferred. - Streaming: withWatermark, session_window — no streaming execution - Sketch aggregates: HLL, count-min sketch — optional - RDD / distributed: rdd, foreach, mapInPandas — eager execution only - Catalog DDL: CREATE TABLE, etc. — use write to path

Deliverables (done): DEFERRED_SCOPE.md created with rationale and workarounds for each category; cross-references added in PYSPARK_DIFFERENCES, README, ROBIN_SPARKLESS_MISSING.


Success Metrics

Metric Current Target
Parity fixtures passing 201 200+
Functions (exact + partial) ~220 ~380+
DataFrame methods ~68 ~90
DataFrameReader methods Partial 12
DataFrameWriter methods Partial 16
Signature "exact" count 15 100+
GAP_ANALYSIS "missing" count 195 <50

Dependencies


Execution Order

  1. Phase A (signature alignment) — enables drop-in compatibility
  2. Phase G (fixtures) — run in parallel; expand as phases complete
  3. Phase B (missing functions) — highest user impact
  4. Phase C (Reader/Writer) — completes IO surface
  5. Phase D (DataFrame methods) — fills remaining DataFrame gaps
  6. Phase E (SparkSession/Catalog) — API completeness
  7. Phase F (behavioral alignment) — polish

Total estimated effort: 14–21 weeks for Phases A–G (excluding deferred Phase H).