Full Parity Roadmap: Robin-Sparkless vs PySpark¶

A phased plan to achieve full API and behavioral parity between robin-sparkless and Apache PySpark. Based on GAP_ANALYSIS_PYSPARK_REPO.md, ROBIN_SPARKLESS_MISSING.md, and ROADMAP.md.

Current State (February 2026)¶

Area	Robin-Sparkless	PySpark	Gap
Functions	~295+ implemented	~415	~120 (many are aliases or param-name mismatches)
DataFrame methods	~80+ (Phase D: view, corr/cov, aliases, stubs)	~95	~15
DataFrameReader	spark.read().option/options/format/load/table/csv/parquet/json	12+ methods	~6 (jdbc, orc, text, schema full impl)
DataFrameWriter	option/options/partition_by/parquet/csv/json, saveAsTable (in-memory)	16+ methods	~9 (bucketBy, insertInto, orc, text)
Column methods	Many as module functions	17 in class	Structural (robin uses F.xxx style)
GroupedData	Strong	10 methods	~2
SparkSession	Core	36 methods	~25
Catalog	dropTempView, listTables, tableExists, dropTable (in-memory)	27 methods	~23 (createTable, getTable, etc. stubs)
Window	Core	4 WindowSpec methods	4

Parity fixtures: 201 passing (11 skipped). Target: 200+ for full confidence. Phase-specific tests: make test-parity-phase-a … make test-parity-phase-g; see PARITY_STATUS.md and tests/fixtures/phase_manifest.json.

Phases Overview¶

Phase	Focus	Est. Effort	Priority
A	Signature alignment (param names)	2–3 weeks	High
B	Missing high-value functions	3–4 weeks	High
C	DataFrameReader / DataFrameWriter parity	2–3 weeks	High
D	DataFrame method gaps	2–3 weeks	Medium
E	SparkSession & Catalog stubs	1–2 weeks	Medium
F	Behavioral alignment (diverges)	2–3 weeks	Medium
G	Parity fixture expansion	2–3 weeks	High
H	Deferred / optional (XML, UDF, streaming)	—	Low

Phase A: Signature Alignment (2–3 weeks) — COMPLETED¶

Goal: Align parameter names and optional args so existing PySpark code works with minimal changes.

Scope: 205 functions/methods classified as "partial" (same logic, different param names).

Category	PySpark style	Robin style	Action
Column arg	`col`	`column`	Add `col` as alias or primary
Numeric args	`months`, `days`	`n`	Add PySpark param names
Optional args	`errMsg=None`	(missing)	Add optional params with defaults
Array/struct	`cols` / `*cols`	varargs	Align signatures

Deliverables (done): - PyO3 bindings use col and PySpark camelCase param names (errMsg, fromBase, dayOfWeek, etc.) - Optional params (e.g. assert_true(col, errMsg=None)) with PySpark defaults - Gap analysis partial count reduced from 205 to 21 (exact: 199). See GAP_ANALYSIS_PYSPARK_REPO.md - Enhanced extract_robin_api_from_source.py parses #[pyo3(signature=...)] for accurate comparison - make gap-analysis-runtime target added for introspection-based gap analysis - Default normalization in gap analysis (0/'0', None/'None') for fair comparison - PyO3 signatures: bround(scale=0), locate(pos=1), btrim(trim=None), from_unixtime(format=None), overlay(len=-1) - PySpark param renames: conv(fromBase, toBase), convert_timezone(sourceTz, targetTz, sourceTs), sha2/shift_left/shift_right(numBits), months_between(roundOff), next_day(dayOfWeek), like/ilike(escapeChar), parse_url(partToExtract), assert_true/raise_error(errMsg), json_tuple(*fields), regexp_extract_all(str, regexp), make_timestamp(years, months, ...)

Phase B: Missing High-Value Functions (3–4 weeks) — COMPLETED¶

Goal: Implement functions that appear in PySpark gap analysis as "missing" but are commonly used.

Deliverables (done): - Tier 1 functions exposed in src/python/mod.rs: abs, date_add, date_sub, date_format, current_date, current_timestamp, char_length, character_length, date_trunc, array, array_contains, array_max, array_min, array_position, array_size, size, array_join, array_sort, cardinality - Aliases: ceil → ceiling, mean → avg, std → stddev, sign → signum - Varargs: array(*cols) via #[pyo3(signature = (*cols))] - aggregate(col, zero) exposed for array fold - plan/expr.rs: Added array, array_max, array_min, cardinality, char_length, character_length, date_trunc to expr_from_fn_rest - Parity fixtures: abs, array_from_cols, date_format, char_length, current_date_timestamp (skipped; non-deterministic) - date_format: Now accepts PySpark/Java SimpleDateFormat (e.g. yyyy-MM) via pyspark_format_to_chrono conversion - Gap analysis: missing reduced from 195 → 171; exact increased 199 → 214

Medium priority (deferred): - bucket, call_function (stub or narrow implementation) - cume_dist, percent_rank (window) — ensure full parity - window, window_time — thin wrappers over .over()

Phase C: DataFrameReader / DataFrameWriter Parity (2–3 weeks) — COMPLETED¶

Goal: Expose full Reader/Writer API so spark.read.* and df.write.* match PySpark.

Deliverables (done): - DataFrameReader (Rust): option, options, format, schema (stub), load, table, csv, parquet, json — options applied (header, inferSchema, sep, nullValue) - DataFrameWriter (Rust): option, options, partition_by, parquet, csv, json — format helpers; partitionBy stored (partitioned write stubbed) - PyDataFrameReader: read() on SparkSession; option, options, format, schema, load, table, csv, parquet, json - PyDataFrameWriter: option, options, partition_by, parquet, csv, json - Parity fixtures: read_csv_with_options (reader_options), read_table (table_source) - spark.read.csv(...), spark.read.option("header","true").csv(path), spark.read.format("parquet").load(path), spark.read.table("name") work - df.write.mode("overwrite").parquet(path), df.write.option("header","true").csv(path) work - saveAsTable(name, mode) implemented (in-memory; session-scoped). Stubs: jdbc, orc, text, bucketBy, insertInto (out of scope)

Phase D: DataFrame Method Gaps (2–3 weeks) — COMPLETED¶

Goal: Implement remaining DataFrame methods from gap analysis (52 missing).

Deliverables (done): - View methods: df.createOrReplaceTempView(name), createTempView, createGlobalTempView, createOrReplaceGlobalTempView — use default session from get_or_create - corr/cov: df.corr() (matrix), df.corr(col1, col2) (scalar), df.cov(col1, col2) (scalar) - Aliases: toDF, toJSON, toPandas + snake_case to_df, to_json, to_pandas - Exposed stubs: hint, repartitionByRange, sortWithinPartitions, sameSemantics, semanticHash, columns, cache, isLocal, inputFiles - writeTo stub: raises NotImplementedError (use df.write().parquet(path) instead) - checkpoint, local_checkpoint, observe, withWatermark already present

Phase E: SparkSession & Catalog Stubs (1–2 weeks) — COMPLETED¶

Goal: Expose SparkSession and Catalog methods for API compatibility.

Deliverables (done): - Catalog class: dropTempView, dropGlobalTempView, listTables(dbName=None) (names from temp views + saved tables), tableExists(tableName, dbName=None), dropTable(tableName) (in-memory saved tables only), currentDatabase, currentCatalog, listDatabases, listCatalogs (functional where supported); cacheTable, uncacheTable, clearCache, refreshTable, refreshByPath, recoverPartitions (no-op); createTable, createExternalTable, getDatabase, getFunction, getTable, registerFunction (NotImplementedError); databaseExists, functionExists, isCached, listColumns, listFunctions, setCurrentCatalog, setCurrentDatabase (stub/fixed). - RuntimeConfig (spark.conf): get, getAll, set (NotImplementedError), isModifiable. - SparkSession: catalog(), conf(), newSession(), stop(), range(end) / range(start, end, step), version, udf() (NotImplementedError), getActiveSession(), getDefaultSession() (classmethods). - Rust session: drop_temp_view, drop_global_temp_view, table_exists, list_temp_view_names, list_table_names, drop_table, range(start, end, step). - Tests: test_phase_e_spark_session_catalog.

Phase F: Behavioral Alignment (2–3 weeks) — COMPLETED¶

Goal: Reduce semantic divergences documented in PYSPARK_DIFFERENCES.md.

Deliverables (done): - assert_true: Aligned with PySpark — fails on false or null; returns null on success. - raise_error: Error message uses user message directly (no prefix). - rand/randn: Documented; per-row when used via with_column/with_columns. - from_utc_timestamp/to_utc_timestamp: Documented (identity for UTC; full conversion deferred). - unix_timestamp/from_unixtime: Documented timezone assumptions. - AES: Documented mode/padding (GCM only; CBC etc. not supported). - Fixtures: assert_true.json, assert_true_err_msg.json (void/null); assert_true_null_fails, assert_true_false_fails (skipped).

Phase G: Parity Fixture Expansion (2–3 weeks) — COMPLETED¶

Goal: Grow parity coverage from 159 to 200+ fixtures.

Actions: - Convert more Sparkless expected_outputs via convert_sparkless_fixtures.py - Add fixtures for new functions (Phase B) - Add fixtures for Reader/Writer options (Phase C) - Add fixtures for signature alignment (Phase A) - Run make sparkless-parity in CI

Deliverables (done): - 201 parity fixtures passing (40+ hand-written fixtures added: filter_age_lt_25, filter_name_eq, select_single_column, groupby_count_desc, limit_one/three, orderby_desc, with_column_lit, distinct_all, fillna_simple, filter_then_select, groupby_sum_simple, filter_ge/le/ne, filter_or_simple, filter_eq_lit, select_reorder, cross_join_small, union_three_rows, and more) - CI runs full parity suite (cargo nextest run --test parity)

Phase H: Deferred / Optional — COMPLETED (documented)¶

Explicitly out of scope for full parity (document only): - XML / XPath: from_xml, to_xml, schema_of_xml, xpath* — require XML parser - UDF / UDTF: Full PySpark-style pandas_udf decorator (all function types), udtf, and advanced UDF integration in SQL/streaming. Scalar UDFs (Rust and Python), column-wise vectorized UDFs in withColumn/select, and grouped vectorized aggregation UDFs via pandas_udf(..., function_type="grouped_agg") in groupBy().agg(...) are implemented; everything else in this space is documented as optional/deferred. - Streaming: withWatermark, session_window — no streaming execution - Sketch aggregates: HLL, count-min sketch — optional - RDD / distributed: rdd, foreach, mapInPandas — eager execution only - Catalog DDL: CREATE TABLE, etc. — use write to path

Deliverables (done): DEFERRED_SCOPE.md created with rationale and workarounds for each category; cross-references added in PYSPARK_DIFFERENCES, README, ROBIN_SPARKLESS_MISSING.

Success Metrics¶

Metric	Current	Target
Parity fixtures passing	201	200+
Functions (exact + partial)	~220	~380+
DataFrame methods	~68	~90
DataFrameReader methods	Partial	12
DataFrameWriter methods	Partial	16
Signature "exact" count	15	100+
GAP_ANALYSIS "missing" count	195	<50

Dependencies¶

GAP_ANALYSIS_PYSPARK_REPO.md — gap matrix (regenerate with make gap-analysis)
DEFERRED_SCOPE.md — Phase H: deferred/optional scope (XML, UDF, streaming, RDD, etc.)
PYSPARK_DIFFERENCES.md — semantic divergences
ROBIN_SPARKLESS_MISSING.md — canonical missing list
PARITY_STATUS.md — fixture coverage and phase test coverage

Execution Order¶

Phase A (signature alignment) — enables drop-in compatibility
Phase G (fixtures) — run in parallel; expand as phases complete
Phase B (missing functions) — highest user impact
Phase C (Reader/Writer) — completes IO surface
Phase D (DataFrame methods) — fills remaining DataFrame gaps
Phase E (SparkSession/Catalog) — API completeness
Phase F (behavioral alignment) — polish

Total estimated effort: 14–21 weeks for Phases A–G (excluding deferred Phase H).