Full Sparkless Backend Roadmap¶

This document plans the path for robin-sparkless to become a complete backend replacement for Sparkless. Sparkless implements 403+ PySpark functions and 85+ DataFrame methods; robin-sparkless currently covers ~283 functions with 159 parity fixtures (Phases 11–24 + signature alignment). Phase 25 ✅ completed (plan interpreter, expression interpreter extended to all scalar functions, logical plan schema, 3 plan fixtures, create_dataframe_from_rows). Next: Phase 26 (publish crate), Phase 27 (integration).

Reference: PYSPARK_FUNCTION_MATRIX catalogs all functions/methods; SPARKLESS_INTEGRATION_ANALYSIS.md describes architecture mapping.

Gap closure (Feb 2026): Bitmap (5), make_dt_interval, make_ym_interval, to_timestamp_ltz, to_timestamp_ntz, sequence, shuffle, inline, inline_outer, regr_* (9); cube, rollup, write (parquet/csv/json); DataFrame stubs (data, toLocalIterator, persist, unpersist, rdd, foreach, foreachPartition, mapInPandas, mapPartitions, storageLevel, isStreaming, withWatermark). XML/XPath/sentences documented as deferred. See ROBIN_SPARKLESS_MISSING.md and PYSPARK_DIFFERENCES.md.

Current State (February 2026)¶

Area	Robin-Sparkless	Sparkless	Gap
Functions	~295+ (Phase 23 + gap closure: bitmap, make_dt_interval, make_ym_interval, to_timestamp_ltz/ntz, sequence, shuffle, inline, inline_outer, regr_*; Phase 23: isin, url_decode, url_encode, hash, shift_left, shift_right, version, equal_null, stack; Phase 22: curdate, now, localtimestamp, date_diff, dateadd, datepart, extract, dayname, weekday, make_timestamp, timestampadd, timestampdiff, from_utc_timestamp, to_utc_timestamp, etc.; Phase 21–8: btrim, locate, conv, hex, unhex, Map, JSON, array extensions, string 6.4 — all implemented)	403	~105
DataFrame methods	~68+ (Phase 12 + gap closure: cube, rollup, write, data, toLocalIterator, persist, unpersist; stubs: rdd, foreach, foreachPartition, mapInPandas, mapPartitions, storageLevel, isStreaming, withWatermark; Phase 12: filter, select, orderBy, groupBy, withColumn, join, union, distinct, drop, dropna, fillna, limit, sample, first, head, tail, explain, print_schema, stat, na, etc.; Phase 20: order_by_exprs)	85	~17
Parity fixtures	159 passing	270+ expected_outputs	118+
PyO3 bridge	✅ Optional `pyo3` feature; `robin_sparkless` Python module	—	—
SQL	Optional `sql` feature: SELECT, FROM, WHERE, JOIN, GROUP BY, ORDER BY, LIMIT; temp views	Full DDL/DML support	Subqueries, CTEs, DDL, HAVING

Phase Overview¶

Phase	Goal	Est. Effort
1. Foundation	Structural alignment, case sensitivity, fixture converter	2–3 weeks
2. High-Value Functions	Top 60 functions used by Sparkless parity tests	4–6 weeks
3. DataFrame Methods	Core methods: union, distinct, drop, fillna, etc.	3–4 weeks
4. PyO3 Bridge	Python bindings so Sparkless can call robin-sparkless	4–6 weeks
5. Test Conversion	Convert 50+ Sparkless tests, run in CI	2–3 weeks
6. Broad Function Parity	Array (array_position, array_remove, posexplode ✅; array_repeat → Phase 8), Map/JSON/string 6.4/window → Phase 8	8–12 weeks
7. SQL & Advanced	SQL executor, Delta Lake, performance & robustness	✅ Completed (optional `sql` / `delta` features)
8. Remaining Parity	✅ String 6.4 (mask, translate, substring_index, soundex, levenshtein, crc32, xxhash64); array extensions (exists, forall, filter, transform, sum, mean, array_repeat, array_flatten); Map (create_map, map_keys, map_values, map_entries, map_from_arrays); JSON (get_json_object, from_json, to_json); window fixtures covered	Done

Phase 1: Foundation (2–3 weeks)¶

Goal: Align structure with Sparkless so subsequent work maps cleanly.

1.1 Structural Alignment¶

[x] Split dataframe.rs into submodules:
src/dataframe/transformations.rs (filter, select, withColumn)
src/dataframe/aggregations.rs (groupBy, agg)
src/dataframe/joins.rs (join logic)
[ ] Future: Trait-based backend abstraction (trait QueryExecutor, trait DataMaterializer) for pluggability
[ ] Future: Document expression model (Column/Expr) and ensure it can represent Sparkless ColumnOperation trees

1.2 Case Sensitivity¶

[x] Add spark.sql.caseSensitive configuration (default: false)
[x] Centralized column resolution for filter, select, withColumn, join
[x] Fixture for case-insensitive column matching (case_insensitive_columns)

1.3 Fixture Converter¶

[x] Script: tests/convert_sparkless_fixtures.py (or Rust tool)
[x] Map Sparkless JSON (input_data, expected_output) → robin-sparkless (input, operations, expected)
[x] Handle operation mapping: filter_operations, groupby, join, window, withColumn, union, distinct, drop, dropna, fillna, limit, withColumnRenamed
[x] Target: Convert 10–20 high-value Sparkless parity tests (run make sparkless-parity with SPARKLESS_EXPECTED_OUTPUTS set)

Phase 2: High-Value Functions (4–6 weeks)¶

Goal: Implement functions that appear most often in Sparkless parity tests and expected_outputs.

2.1 String (extend beyond current)¶

Function	Polars API	Priority
`length`	`col.str().len_chars()`	High
`trim` / `ltrim` / `rtrim`	`str().strip_chars()`, `strip_chars_start`, `strip_chars_end`	High
`regexp_extract`	`str().extract()`	High
`regexp_replace`	`str().replace()`	High
`split`	`str().split()`	High
`initcap`	`str().to_titlecase()`	Medium
`repeat`	`str().repeat_by()`	Medium
`reverse`	`str().reverse()`	Medium
`instr` / `locate`	`str().find()`	Medium
`lpad` / `rpad`	`str().pad_start` / `pad_end`	Medium

2.2 Datetime¶

Function	Polars API	Priority
`current_date`	`lit(Date::today())`	High
`current_timestamp`	`lit(Utc::now())`	High
`to_date`	`col.cast(Date)`	High
`date_add`	`col + Duration::days(n)`	High
`date_sub`	`col - Duration::days(n)`	High
`date_format`	`col.dt().strftime()`	High
`year`, `month`, `day`, `hour`, `minute`, `second`	`col.dt().year()`, etc.	High
`datediff`	`(col1 - col2).dt().total_days()`	Medium
`last_day`	`col.dt().last_day_of_month()`	Medium
`trunc`	`col.dt().truncate()`	Medium

2.3 Math & Aggregates¶

Function	Polars API	Priority
`abs`, `ceil`, `floor`, `round`	`col.abs()`, `ceil()`, `floor()`, `round()`	High
`sqrt`, `pow`, `exp`, `log`	`col.sqrt()`, `pow()`, `exp()`, `log()`	High
`stddev` / `stddev_samp`	`col.std()`	High
`variance` / `var_samp`	`col.var()`	High
`count_distinct`	`col.n_unique()`	High
`first`, `last`	`col.first()`, `last()`	High
`approx_count_distinct`	`col.n_unique()` (or HLL if available)	Medium

2.4 Conditional & Null¶

Function	Status	Notes
`when`, `coalesce`	✅ Done	—
`ifnull` / `nvl`	To add	Alias for coalesce
`nullif`	To add	Returns null if equal
`nanvl`	To add	Replace NaN with value

Phase 3: DataFrame Methods (3–4 weeks) ✅ COMPLETED¶

Goal: Implement methods needed for Sparkless DataFrame pipelines.

Method	Polars API	Status
`union` / `unionAll`	`concat` (LazyFrame)	✅ Done
`unionByName`	`concat` with schema alignment by name	✅ Done
`distinct` / `dropDuplicates`	`LazyFrame.unique()`	✅ Done
`drop`	`DataFrame.select()` (exclude columns)	✅ Done
`dropna`	`LazyFrame.drop_nulls()`	✅ Done
`fillna`	`LazyFrame.with_columns` + `fill_null`	✅ Done
`limit`	`DataFrame.head(n)`	✅ Done
`withColumnRenamed`	`DataFrame.rename()`	✅ Done
`replace`	`LazyFrame.replace()`	Medium
`crossJoin`	`LazyFrame.join(..., how=Cross)`	Medium
`describe`	`LazyFrame.describe()`	Medium
`cache` / `persist`	Materialize and store; `unpersist`	Medium
`subtract` / `except`	Anti-join or set diff	Medium
`intersect` / `intersectAll`	Set operations	Low

Phase 4: PyO3 Bridge (4–6 weeks) ✅ COMPLETED¶

Goal: Enable Sparkless (Python) to call robin-sparkless (Rust) for execution.

4.1 Crate Layout¶

[x] Optional pyo3 feature in main crate; src/python/mod.rs compiled when pyo3 enabled
[x] Expose SparkSession, SparkSessionBuilder, DataFrame, Column, GroupedData as Python classes (PySpark-like names)
[x] Data transfer: create_dataframe (list of 3-tuples); collect → list of dicts

4.2 API Surface¶

[x] SparkSession.builder().app_name(...).get_or_create(), create_dataframe, read_csv, read_parquet, read_json
[x] DataFrame: filter, select, with_column, order_by, group_by, join, union, union_by_name, distinct, drop, dropna, fillna, limit, with_column_renamed, count, show, collect
[x] Column and module-level: col, lit, when().then().otherwise(), coalesce, sum, avg, min, max, count
[x] GroupedData: count(), sum(column), avg(column), min(column), max(column), agg(exprs)

4.3 Sparkless Integration (out of scope in this repo)¶

[ ] Sparkless BackendFactory adds "robin" backend option (implemented in Sparkless repo)
[ ] When "robin" selected, Sparkless delegates to robin-sparkless via PyO3
[ ] Fallback: if robin-sparkless doesn't support an op, raise or fall back to Python Polars

See EMBEDDING.md for the API surface and bindings contract Sparkless maintainers need.

4.4 Risks¶

Schema round-trip: PySpark/Sparkless types ↔ Polars ↔ Arrow; ensure nullability and types align
Performance: PyO3 overhead vs. native Python Polars; benchmark
Error handling: Rust errors → Python exceptions with useful messages

Phase 5: Test Conversion (2–3 weeks)¶

Goal: Run Sparkless parity tests against robin-sparkless backend.

[x] Fixture converter produces robin-sparkless fixtures from Sparkless expected_outputs (join, window, withColumn, union, distinct, drop, dropna, fillna, limit, withColumnRenamed; output to tests/fixtures/converted/ with --output-subdir)
[x] Identify tests that use only supported ops; run those first (run make sparkless-parity with SPARKLESS_EXPECTED_OUTPUTS set when Sparkless repo available)
[x] CI: make sparkless-parity runs converted tests (converter when path set, then cargo test pyspark_parity_fixtures; parity discovers tests/fixtures/ and tests/fixtures/converted/)
[x] Target: 50+ tests passing on robin-sparkless (93 hand-written passing; document in SPARKLESS_PARITY_STATUS.md; add converted when Sparkless expected_outputs used)
[x] Document which tests fail and why (missing function, semantic difference) in SPARKLESS_PARITY_STATUS.md; fixtures can use skip: true + skip_reason

Phase 6: Broad Function Parity (8–12 weeks)¶

Goal: Implement remaining high-usage functions from PYSPARK_FUNCTION_MATRIX.

6.1 Array Functions (Phase 6a done)¶

[x] array, array_contains, array_join, array_max, array_min
[x] array_size, array_sort, element_at, explode
[x] array_slice, size (alias for array_size)
[x] array_position, array_remove, posexplode (implemented via Polars list.eval; Rust + Python)
[x] array_repeat, array_flatten (Phase 8; implemented via Expr::map UDFs)
[x] array_exists, array_forall, array_filter, array_transform, array_sum, array_mean (list.eval / list_any_all)
[ ] aggregate (array_aggregate; optional follow-up)

6.2 Map Functions ✅ (Phase 8)¶

Map represented as List(Struct{key, value}). Phase 8 completed: create_map (as_struct + concat_list), map_keys/map_values (list.eval + struct.field_by_name), map_entries (identity), map_from_arrays (zip UDF).

6.3 JSON & Binary → Phase 8¶

JSON/binary deferred to Phase 8. Polars JSON support is behind optional features.
Phase 8: get_json_object, from_json, to_json, base64, unbase64, etc.

6.4 Additional String ✅¶

[x] regexp_extract_all, regexp_like (Phase 6e)
[x] regexp_replace (already present)
[x] mask, translate, substring_index (Phase 10)
[x] soundex, levenshtein, crc32, xxhash64 (Phase 8; UDFs via strsim, crc32fast, twox-hash, soundex crates)

6.5 Window Extensions (partial; fixture simplification → Phase 8)¶

[x] percent_rank, first_value, last_value (Phase 6d)
[x] API cume_dist, ntile, nth_value (Rust + Python; parity fixtures skipped: Polars does not allow combining rank().over() and count().over() in one expr)
Phase 8: Enable percent_rank/cume_dist/ntile/nth_value parity fixtures when Polars allows, or keep multi-step workaround

6.6 Deferred / Out of Scope¶

UDFs: Python UDFs require Python runtime; document as out of scope; consider pure-Rust UDFs
SQL: Phase 7.1 ✅ — optional sql feature: sqlparser + translation to DataFrame ops; temp views
Delta Lake: Phase 7.2 ✅ — optional delta feature: read_delta, write_delta, time travel
XML: xpath_*; low priority
Specialized: histogram_numeric, hll_*, count_min_sketch; defer

Phase 7: SQL & Advanced (Ongoing)¶

7.1 SQL Support (Optional) ✅¶

[x] SparkSession::sql(query) → parse SQL, translate to DataFrame ops (when sql feature enabled)
[x] Use sqlparser for parsing; execution via existing DataFrame API
[x] Support: SELECT, FROM (single table or JOIN), WHERE, GROUP BY + aggregates (COUNT, SUM, AVG, MIN, MAX), ORDER BY, LIMIT
[x] Temporary views: createOrReplaceTempView, table()
[x] In-memory saved tables: df.write().saveAsTable(name, mode), write_delta_table(name); resolution (temp view then saved table) for table(name) and read_delta(name); catalog listTables, tableExists, dropTempView, dropTable. Tables are session-scoped (no disk persistence).
Limits (first iteration): No subqueries in FROM, no CTEs, no DDL, no HAVING; only unqualified table names; document unsupported constructs with clear errors.

7.2 Delta Lake (Optional) ✅¶

[x] Read/write Delta tables (optional delta feature; deltalake + Polars)
[x] Time travel: read_delta_with_version(path, version) (read by version)
[x] Table by name: read_delta(name_or_path) (path → Delta on disk; name → in-memory table when sql enabled); write_delta_table(name) registers for read_delta(name)
[x] Overwrite/append: write_delta(path, overwrite)
[ ] Schema evolution, MERGE (deferred; document as Phase 7 follow-up)

7.3 Performance & Robustness ✅¶

[x] Benchmarks: robin-sparkless vs. plain Polars (cargo bench; criterion)
[x] Ensure within 2x of Polars for measured pipelines
[x] Error messages with context (column names, hints); Troubleshooting in QUICKSTART.md
[ ] Memory profiling, large-dataset handling (optional follow-up)

Phase 8: Remaining Parity ✅ COMPLETED (February 2026)¶

Goal: Implement or enable features deferred from Phase 6.

Item	Status
array_repeat	✅ Implemented via `Expr::map` UDF (list try_apply_amortized + extend).
array_flatten	✅ Implemented via `Expr::map` UDF (list-of-lists flatten per row).
Map (6b)	✅ `create_map` (as_struct + concat_list), `map_keys`/`map_values` (list.eval + struct.field_by_name), `map_entries` (identity), `map_from_arrays` (zip UDF with list builder). Map represented as `List(Struct{key, value})`.
JSON (6c)	✅ get_json_object, from_json, to_json (Phase 10).
String 6.4	✅ mask, translate, substring_index (Phase 10); soundex, levenshtein, crc32, xxhash64 via `Expr::map`/`map_many` UDFs (strsim, crc32fast, twox-hash, soundex crates).
Window fixture simplification	✅ percent_rank, cume_dist, ntile, nth_value covered via multi-step workaround in harness.
Documentation of differences	✅ PYSPARK_DIFFERENCES.md updated; no Phase 8 stubs remaining.

Success Metrics¶

Metric	Current	After Phase 19	After Phase 24 (full parity)	After Phase 26 (crate)	Full Backend (Phase 27)
Parity fixtures	159	159	180+	180+	180+
Functions implemented	~283	~283	~400	~400	~400
DataFrame methods	~55+	~55+	~55+	~55+	85
Crate on crates.io	No	—	—	Yes	Yes
Sparkless tests passing (robin backend)	0	—	—	—	200+
PyO3 bridge	✅ Yes (optional)	Yes	Yes	Yes	Yes

Path to 100% Before Sparkless Integration (ROADMAP Phases 12–22)¶

To reach 100% feature parity and a published crate before wiring the robin backend into Sparkless, ROADMAP.md defines the following phases between Phase 11 (done) and integration:

ROADMAP Phase	Goal	Est. Effort
12	DataFrame methods parity: Implement remaining ~50–60 methods → 85 total	✅ COMPLETED
13	Functions batch 1: String, binary, collection (ascii, base64, overlay, sha1, sha2, md5, array_compact, etc.)	✅ COMPLETED
14	Functions batch 2: Math, datetime, type/conditional (sin/cos/tan, quarter, add_months, cast, try_cast, greatest, least)	✅ COMPLETED
15	Functions batch 3 ✅ COMPLETED: Batches 1–4 (aliases, string left/right/replace/like, math cosh/cbrt/etc., array_distinct); 88 fixtures.	—
16	Remaining gaps 1 ✅ COMPLETED: String/regex (regexp_count, regexp_instr, regexp_substr, split_part, find_in_set, format_string, printf); 93 fixtures.	—
17	Remaining gaps 2 ✅ COMPLETED: Datetime/unix (unix_timestamp, from_unixtime, make_date, timestamp_*, pmod, factorial).	—
18	Remaining gaps 3 ✅ COMPLETED: array/map/struct (map_filter, zip_with, map_zip_with); 124 fixtures.	—
19	Remaining gaps 4 ✅ COMPLETED: aggregates, try_*, misc; 128 fixtures.	—
20	Full parity 1: ordering, aggregates, numeric	✅ COMPLETED
21	Full parity 2: string, binary, type, array/map/struct	✅ COMPLETED
22	Full parity 3: datetime extensions	✅ COMPLETED
23	Full parity 4: JSON, CSV, URL, misc	✅ COMPLETED
24	Full parity 5: bit, control, JVM stubs, random, crypto	✅ COMPLETED
25	Publish Rust crate: crates.io, API stability, docs, release; optional PyPI wheel	2–3 weeks
26	Sparkless integration: BackendFactory "robin", 200+ tests passing, PyO3 surface	4–6 weeks

Detail for each phase is in ROADMAP.md (§ Phase 12–26).

Implementation Order (Summary)¶

Phase 1: Fixture converter, case sensitivity, structural split ✅
Phase 2: String (length, trim, regexp_*), datetime (to_date, date_add, etc.), math (stddev, variance) ✅
Phase 3: union, unionByName, distinct, drop, dropna, fillna, limit, withColumnRenamed ✅
Phase 4: PyO3 bridge ✅ COMPLETED
Phase 5: Convert Sparkless tests, CI integration
Phase 6: Array (6a ✅; array_position, array_remove, posexplode implemented; array_repeat → Phase 8), Map (6b → Phase 8), JSON (6c → Phase 8), additional string (6e ✅; 6.4 → Phase 8), window extensions (6d ✅; fixture simplification → Phase 8).
Phase 7: SQL, Delta, performance ✅ Completed (optional features; see §7)
Phase 8: ✅ COMPLETED – array_repeat, array_flatten, Map (6b), String 6.4 (soundex/levenshtein/crc32/xxhash64), window fixtures, documentation (see Phase 8 section above)
Phase 9: High-value functions (datetime, string repeat/reverse/lpad/rpad, math sqrt/pow/exp/log, nvl/nullif/nanvl, first/last/approx_count_distinct) + DataFrame methods (replace, cross_join, describe, cache/persist/unpersist, subtract, intersect) ✅ COMPLETED
Phase 10: Complex types (Map, JSON, array_repeat, string 6.4) + window fixture simplification ✅ COMPLETED
Phase 11–25: Parity scale (159 fixtures), harness date/datetime/boolean, Phase 12 DataFrame methods, Phase 13–17 functions batches, Phase 18 array/map/struct, Phase 19 aggregates, try_, misc; Phase 20 ordering, aggregates, numeric; Phase 21 string, binary, type, array/map/struct; Phase 22 datetime extensions; Phase 23 JSON/CSV/URL/misc; Phase 24 ✅ (bit, control, JVM stubs, rand/randn, AES crypto); Phase 25 ✅ COMPLETED* (plan interpreter, expression interpreter for all scalar functions, LOGICAL_PLAN_FORMAT.md, 3 plan fixtures, create_dataframe_from_rows). Phase 26 (publish), Phase 27 (integration). See ROADMAP.md.
ROADMAP Phase 12–27: Path to 100% — Phases 12–25 done (DataFrame methods ~55+, ~283 functions, 159 fixtures; plan interpreter, expression interpreter for all scalar functions, 3 plan fixtures in tests/fixtures/plans/). Phase 26: prepare and publish crate (crates.io, docs, release). Phase 27: Sparkless integration (see § Path to 100% above).

ROADMAP.md – High-level roadmap and current status
PARITY_STATUS.md – Parity matrix and fixtures
SPARKLESS_INTEGRATION_ANALYSIS.md – Architecture mapping
SPARKLESS_REFACTOR_PLAN.md – Refactor plan for Sparkless (serializable logical plan)
READINESS_FOR_SPARKLESS_PLAN.md – Robin-sparkless readiness (plan interpreter, fixtures) before merge
TEST_CREATION_GUIDE.md – How to add fixtures