Full Parity Roadmap: Robin-Sparkless vs PySpark¶
A phased plan to achieve full API and behavioral parity between robin-sparkless and Apache PySpark. Based on GAP_ANALYSIS_PYSPARK_REPO.md, ROBIN_SPARKLESS_MISSING.md, and ROADMAP.md.
Current State (February 2026)¶
| Area | Robin-Sparkless | PySpark | Gap |
|---|---|---|---|
| Functions | ~295+ implemented | ~415 | ~120 (many are aliases or param-name mismatches) |
| DataFrame methods | ~80+ (Phase D: view, corr/cov, aliases, stubs) | ~95 | ~15 |
| DataFrameReader | spark.read().option/options/format/load/table/csv/parquet/json | 12+ methods | ~6 (jdbc, orc, text, schema full impl) |
| DataFrameWriter | option/options/partition_by/parquet/csv/json, saveAsTable (in-memory) | 16+ methods | ~9 (bucketBy, insertInto, orc, text) |
| Column methods | Many as module functions | 17 in class | Structural (robin uses F.xxx style) |
| GroupedData | Strong | 10 methods | ~2 |
| SparkSession | Core | 36 methods | ~25 |
| Catalog | dropTempView, listTables, tableExists, dropTable (in-memory) | 27 methods | ~23 (createTable, getTable, etc. stubs) |
| Window | Core | 4 WindowSpec methods | 4 |
Parity fixtures: 201 passing (11 skipped). Target: 200+ for full confidence. Phase-specific tests: make test-parity-phase-a … make test-parity-phase-g; see PARITY_STATUS.md and tests/fixtures/phase_manifest.json.
Phases Overview¶
| Phase | Focus | Est. Effort | Priority |
|---|---|---|---|
| A | Signature alignment (param names) | 2–3 weeks | High |
| B | Missing high-value functions | 3–4 weeks | High |
| C | DataFrameReader / DataFrameWriter parity | 2–3 weeks | High |
| D | DataFrame method gaps | 2–3 weeks | Medium |
| E | SparkSession & Catalog stubs | 1–2 weeks | Medium |
| F | Behavioral alignment (diverges) | 2–3 weeks | Medium |
| G | Parity fixture expansion | 2–3 weeks | High |
| H | Deferred / optional (XML, UDF, streaming) | — | Low |
Phase A: Signature Alignment (2–3 weeks) — COMPLETED¶
Goal: Align parameter names and optional args so existing PySpark code works with minimal changes.
Scope: 205 functions/methods classified as "partial" (same logic, different param names).
| Category | PySpark style | Robin style | Action |
|---|---|---|---|
| Column arg | col |
column |
Add col as alias or primary |
| Numeric args | months, days |
n |
Add PySpark param names |
| Optional args | errMsg=None |
(missing) | Add optional params with defaults |
| Array/struct | cols / *cols |
varargs | Align signatures |
Deliverables (done):
- PyO3 bindings use col and PySpark camelCase param names (errMsg, fromBase, dayOfWeek, etc.)
- Optional params (e.g. assert_true(col, errMsg=None)) with PySpark defaults
- Gap analysis partial count reduced from 205 to 21 (exact: 199). See GAP_ANALYSIS_PYSPARK_REPO.md
- Enhanced extract_robin_api_from_source.py parses #[pyo3(signature=...)] for accurate comparison
- make gap-analysis-runtime target added for introspection-based gap analysis
- Default normalization in gap analysis (0/'0', None/'None') for fair comparison
- PyO3 signatures: bround(scale=0), locate(pos=1), btrim(trim=None), from_unixtime(format=None), overlay(len=-1)
- PySpark param renames: conv(fromBase, toBase), convert_timezone(sourceTz, targetTz, sourceTs), sha2/shift_left/shift_right(numBits), months_between(roundOff), next_day(dayOfWeek), like/ilike(escapeChar), parse_url(partToExtract), assert_true/raise_error(errMsg), json_tuple(*fields), regexp_extract_all(str, regexp), make_timestamp(years, months, ...)
Phase B: Missing High-Value Functions (3–4 weeks) — COMPLETED¶
Goal: Implement functions that appear in PySpark gap analysis as "missing" but are commonly used.
Deliverables (done):
- Tier 1 functions exposed in src/python/mod.rs: abs, date_add, date_sub, date_format, current_date, current_timestamp, char_length, character_length, date_trunc, array, array_contains, array_max, array_min, array_position, array_size, size, array_join, array_sort, cardinality
- Aliases: ceil → ceiling, mean → avg, std → stddev, sign → signum
- Varargs: array(*cols) via #[pyo3(signature = (*cols))]
- aggregate(col, zero) exposed for array fold
- plan/expr.rs: Added array, array_max, array_min, cardinality, char_length, character_length, date_trunc to expr_from_fn_rest
- Parity fixtures: abs, array_from_cols, date_format, char_length, current_date_timestamp (skipped; non-deterministic)
- date_format: Now accepts PySpark/Java SimpleDateFormat (e.g. yyyy-MM) via pyspark_format_to_chrono conversion
- Gap analysis: missing reduced from 195 → 171; exact increased 199 → 214
Medium priority (deferred):
- bucket, call_function (stub or narrow implementation)
- cume_dist, percent_rank (window) — ensure full parity
- window, window_time — thin wrappers over .over()
Phase C: DataFrameReader / DataFrameWriter Parity (2–3 weeks) — COMPLETED¶
Goal: Expose full Reader/Writer API so spark.read.* and df.write.* match PySpark.
Deliverables (done):
- DataFrameReader (Rust): option, options, format, schema (stub), load, table, csv, parquet, json — options applied (header, inferSchema, sep, nullValue)
- DataFrameWriter (Rust): option, options, partition_by, parquet, csv, json — format helpers; partitionBy stored (partitioned write stubbed)
- PyDataFrameReader: read() on SparkSession; option, options, format, schema, load, table, csv, parquet, json
- PyDataFrameWriter: option, options, partition_by, parquet, csv, json
- Parity fixtures: read_csv_with_options (reader_options), read_table (table_source)
- spark.read.csv(...), spark.read.option("header","true").csv(path), spark.read.format("parquet").load(path), spark.read.table("name") work
- df.write.mode("overwrite").parquet(path), df.write.option("header","true").csv(path) work
- saveAsTable(name, mode) implemented (in-memory; session-scoped). Stubs: jdbc, orc, text, bucketBy, insertInto (out of scope)
Phase D: DataFrame Method Gaps (2–3 weeks) — COMPLETED¶
Goal: Implement remaining DataFrame methods from gap analysis (52 missing).
Deliverables (done):
- View methods: df.createOrReplaceTempView(name), createTempView, createGlobalTempView, createOrReplaceGlobalTempView — use default session from get_or_create
- corr/cov: df.corr() (matrix), df.corr(col1, col2) (scalar), df.cov(col1, col2) (scalar)
- Aliases: toDF, toJSON, toPandas + snake_case to_df, to_json, to_pandas
- Exposed stubs: hint, repartitionByRange, sortWithinPartitions, sameSemantics, semanticHash, columns, cache, isLocal, inputFiles
- writeTo stub: raises NotImplementedError (use df.write().parquet(path) instead)
- checkpoint, local_checkpoint, observe, withWatermark already present
Phase E: SparkSession & Catalog Stubs (1–2 weeks) — COMPLETED¶
Goal: Expose SparkSession and Catalog methods for API compatibility.
Deliverables (done):
- Catalog class: dropTempView, dropGlobalTempView, listTables(dbName=None) (names from temp views + saved tables), tableExists(tableName, dbName=None), dropTable(tableName) (in-memory saved tables only), currentDatabase, currentCatalog, listDatabases, listCatalogs (functional where supported); cacheTable, uncacheTable, clearCache, refreshTable, refreshByPath, recoverPartitions (no-op); createTable, createExternalTable, getDatabase, getFunction, getTable, registerFunction (NotImplementedError); databaseExists, functionExists, isCached, listColumns, listFunctions, setCurrentCatalog, setCurrentDatabase (stub/fixed).
- RuntimeConfig (spark.conf): get, getAll, set (NotImplementedError), isModifiable.
- SparkSession: catalog(), conf(), newSession(), stop(), range(end) / range(start, end, step), version, udf() (NotImplementedError), getActiveSession(), getDefaultSession() (classmethods).
- Rust session: drop_temp_view, drop_global_temp_view, table_exists, list_temp_view_names, list_table_names, drop_table, range(start, end, step).
- Tests: test_phase_e_spark_session_catalog.
Phase F: Behavioral Alignment (2–3 weeks) — COMPLETED¶
Goal: Reduce semantic divergences documented in PYSPARK_DIFFERENCES.md.
Deliverables (done):
- assert_true: Aligned with PySpark — fails on false or null; returns null on success.
- raise_error: Error message uses user message directly (no prefix).
- rand/randn: Documented; per-row when used via with_column/with_columns.
- from_utc_timestamp/to_utc_timestamp: Documented (identity for UTC; full conversion deferred).
- unix_timestamp/from_unixtime: Documented timezone assumptions.
- AES: Documented mode/padding (GCM only; CBC etc. not supported).
- Fixtures: assert_true.json, assert_true_err_msg.json (void/null); assert_true_null_fails, assert_true_false_fails (skipped).
Phase G: Parity Fixture Expansion (2–3 weeks) — COMPLETED¶
Goal: Grow parity coverage from 159 to 200+ fixtures.
Actions:
- Convert more Sparkless expected_outputs via convert_sparkless_fixtures.py
- Add fixtures for new functions (Phase B)
- Add fixtures for Reader/Writer options (Phase C)
- Add fixtures for signature alignment (Phase A)
- Run make sparkless-parity in CI
Deliverables (done): - 201 parity fixtures passing (40+ hand-written fixtures added: filter_age_lt_25, filter_name_eq, select_single_column, groupby_count_desc, limit_one/three, orderby_desc, with_column_lit, distinct_all, fillna_simple, filter_then_select, groupby_sum_simple, filter_ge/le/ne, filter_or_simple, filter_eq_lit, select_reorder, cross_join_small, union_three_rows, and more) - CI runs full parity suite (cargo nextest run --test parity)
Phase H: Deferred / Optional — COMPLETED (documented)¶
Explicitly out of scope for full parity (document only):
- XML / XPath: from_xml, to_xml, schema_of_xml, xpath* — require XML parser
- UDF / UDTF: Full PySpark-style pandas_udf decorator (all function types), udtf, and advanced UDF integration in SQL/streaming. Scalar UDFs (Rust and Python), column-wise vectorized UDFs in withColumn/select, and grouped vectorized aggregation UDFs via pandas_udf(..., function_type="grouped_agg") in groupBy().agg(...) are implemented; everything else in this space is documented as optional/deferred.
- Streaming: withWatermark, session_window — no streaming execution
- Sketch aggregates: HLL, count-min sketch — optional
- RDD / distributed: rdd, foreach, mapInPandas — eager execution only
- Catalog DDL: CREATE TABLE, etc. — use write to path
Deliverables (done): DEFERRED_SCOPE.md created with rationale and workarounds for each category; cross-references added in PYSPARK_DIFFERENCES, README, ROBIN_SPARKLESS_MISSING.
Success Metrics¶
| Metric | Current | Target |
|---|---|---|
| Parity fixtures passing | 201 | 200+ |
| Functions (exact + partial) | ~220 | ~380+ |
| DataFrame methods | ~68 | ~90 |
| DataFrameReader methods | Partial | 12 |
| DataFrameWriter methods | Partial | 16 |
| Signature "exact" count | 15 | 100+ |
| GAP_ANALYSIS "missing" count | 195 | <50 |
Dependencies¶
- GAP_ANALYSIS_PYSPARK_REPO.md — gap matrix (regenerate with
make gap-analysis) - DEFERRED_SCOPE.md — Phase H: deferred/optional scope (XML, UDF, streaming, RDD, etc.)
- PYSPARK_DIFFERENCES.md — semantic divergences
- ROBIN_SPARKLESS_MISSING.md — canonical missing list
- PARITY_STATUS.md — fixture coverage and phase test coverage
Execution Order¶
- Phase A (signature alignment) — enables drop-in compatibility
- Phase G (fixtures) — run in parallel; expand as phases complete
- Phase B (missing functions) — highest user impact
- Phase C (Reader/Writer) — completes IO surface
- Phase D (DataFrame methods) — fills remaining DataFrame gaps
- Phase E (SparkSession/Catalog) — API completeness
- Phase F (behavioral alignment) — polish
Total estimated effort: 14–21 weeks for Phases A–G (excluding deferred Phase H).