Full Sparkless Backend Roadmap¶
This document plans the path for robin-sparkless to become a complete backend replacement for Sparkless. Sparkless implements 403+ PySpark functions and 85+ DataFrame methods; robin-sparkless currently covers ~283 functions with 159 parity fixtures (Phases 11–24 + signature alignment). Phase 25 ✅ completed (plan interpreter, expression interpreter extended to all scalar functions, logical plan schema, 3 plan fixtures, create_dataframe_from_rows). Next: Phase 26 (publish crate), Phase 27 (integration).
Reference: PYSPARK_FUNCTION_MATRIX catalogs all functions/methods; SPARKLESS_INTEGRATION_ANALYSIS.md describes architecture mapping.
Gap closure (Feb 2026): Bitmap (5), make_dt_interval, make_ym_interval, to_timestamp_ltz, to_timestamp_ntz, sequence, shuffle, inline, inline_outer, regr_* (9); cube, rollup, write (parquet/csv/json); DataFrame stubs (data, toLocalIterator, persist, unpersist, rdd, foreach, foreachPartition, mapInPandas, mapPartitions, storageLevel, isStreaming, withWatermark). XML/XPath/sentences documented as deferred. See ROBIN_SPARKLESS_MISSING.md and PYSPARK_DIFFERENCES.md.
Current State (February 2026)¶
| Area | Robin-Sparkless | Sparkless | Gap |
|---|---|---|---|
| Functions | ~295+ (Phase 23 + gap closure: bitmap, make_dt_interval, make_ym_interval, to_timestamp_ltz/ntz, sequence, shuffle, inline, inline_outer, regr_*; Phase 23: isin, url_decode, url_encode, hash, shift_left, shift_right, version, equal_null, stack; Phase 22: curdate, now, localtimestamp, date_diff, dateadd, datepart, extract, dayname, weekday, make_timestamp, timestampadd, timestampdiff, from_utc_timestamp, to_utc_timestamp, etc.; Phase 21–8: btrim, locate, conv, hex, unhex, Map, JSON, array extensions, string 6.4 — all implemented) | 403 | ~105 |
| DataFrame methods | ~68+ (Phase 12 + gap closure: cube, rollup, write, data, toLocalIterator, persist, unpersist; stubs: rdd, foreach, foreachPartition, mapInPandas, mapPartitions, storageLevel, isStreaming, withWatermark; Phase 12: filter, select, orderBy, groupBy, withColumn, join, union, distinct, drop, dropna, fillna, limit, sample, first, head, tail, explain, print_schema, stat, na, etc.; Phase 20: order_by_exprs) | 85 | ~17 |
| Parity fixtures | 159 passing | 270+ expected_outputs | 118+ |
| PyO3 bridge | ✅ Optional pyo3 feature; robin_sparkless Python module |
— | — |
| SQL | Optional sql feature: SELECT, FROM, WHERE, JOIN, GROUP BY, ORDER BY, LIMIT; temp views |
Full DDL/DML support | Subqueries, CTEs, DDL, HAVING |
Phase Overview¶
| Phase | Goal | Est. Effort |
|---|---|---|
| 1. Foundation | Structural alignment, case sensitivity, fixture converter | 2–3 weeks |
| 2. High-Value Functions | Top 60 functions used by Sparkless parity tests | 4–6 weeks |
| 3. DataFrame Methods | Core methods: union, distinct, drop, fillna, etc. | 3–4 weeks |
| 4. PyO3 Bridge | Python bindings so Sparkless can call robin-sparkless | 4–6 weeks |
| 5. Test Conversion | Convert 50+ Sparkless tests, run in CI | 2–3 weeks |
| 6. Broad Function Parity | Array (array_position, array_remove, posexplode ✅; array_repeat → Phase 8), Map/JSON/string 6.4/window → Phase 8 | 8–12 weeks |
| 7. SQL & Advanced | SQL executor, Delta Lake, performance & robustness | ✅ Completed (optional sql / delta features) |
| 8. Remaining Parity | ✅ String 6.4 (mask, translate, substring_index, soundex, levenshtein, crc32, xxhash64); array extensions (exists, forall, filter, transform, sum, mean, array_repeat, array_flatten); Map (create_map, map_keys, map_values, map_entries, map_from_arrays); JSON (get_json_object, from_json, to_json); window fixtures covered | Done |
Phase 1: Foundation (2–3 weeks)¶
Goal: Align structure with Sparkless so subsequent work maps cleanly.
1.1 Structural Alignment¶
- [x] Split
dataframe.rsinto submodules: src/dataframe/transformations.rs(filter, select, withColumn)src/dataframe/aggregations.rs(groupBy, agg)src/dataframe/joins.rs(join logic)- [ ] Future: Trait-based backend abstraction (
trait QueryExecutor,trait DataMaterializer) for pluggability - [ ] Future: Document expression model (Column/Expr) and ensure it can represent Sparkless
ColumnOperationtrees
1.2 Case Sensitivity¶
- [x] Add
spark.sql.caseSensitiveconfiguration (default: false) - [x] Centralized column resolution for filter, select, withColumn, join
- [x] Fixture for case-insensitive column matching (
case_insensitive_columns)
1.3 Fixture Converter¶
- [x] Script:
tests/convert_sparkless_fixtures.py(or Rust tool) - [x] Map Sparkless JSON (
input_data,expected_output) → robin-sparkless (input,operations,expected) - [x] Handle operation mapping: filter_operations, groupby, join, window, withColumn, union, distinct, drop, dropna, fillna, limit, withColumnRenamed
- [x] Target: Convert 10–20 high-value Sparkless parity tests (run
make sparkless-paritywithSPARKLESS_EXPECTED_OUTPUTSset)
Phase 2: High-Value Functions (4–6 weeks)¶
Goal: Implement functions that appear most often in Sparkless parity tests and expected_outputs.
2.1 String (extend beyond current)¶
| Function | Polars API | Priority |
|---|---|---|
length |
col.str().len_chars() |
High |
trim / ltrim / rtrim |
str().strip_chars(), strip_chars_start, strip_chars_end |
High |
regexp_extract |
str().extract() |
High |
regexp_replace |
str().replace() |
High |
split |
str().split() |
High |
initcap |
str().to_titlecase() |
Medium |
repeat |
str().repeat_by() |
Medium |
reverse |
str().reverse() |
Medium |
instr / locate |
str().find() |
Medium |
lpad / rpad |
str().pad_start / pad_end |
Medium |
2.2 Datetime¶
| Function | Polars API | Priority |
|---|---|---|
current_date |
lit(Date::today()) |
High |
current_timestamp |
lit(Utc::now()) |
High |
to_date |
col.cast(Date) |
High |
date_add |
col + Duration::days(n) |
High |
date_sub |
col - Duration::days(n) |
High |
date_format |
col.dt().strftime() |
High |
year, month, day, hour, minute, second |
col.dt().year(), etc. |
High |
datediff |
(col1 - col2).dt().total_days() |
Medium |
last_day |
col.dt().last_day_of_month() |
Medium |
trunc |
col.dt().truncate() |
Medium |
2.3 Math & Aggregates¶
| Function | Polars API | Priority |
|---|---|---|
abs, ceil, floor, round |
col.abs(), ceil(), floor(), round() |
High |
sqrt, pow, exp, log |
col.sqrt(), pow(), exp(), log() |
High |
stddev / stddev_samp |
col.std() |
High |
variance / var_samp |
col.var() |
High |
count_distinct |
col.n_unique() |
High |
first, last |
col.first(), last() |
High |
approx_count_distinct |
col.n_unique() (or HLL if available) |
Medium |
2.4 Conditional & Null¶
| Function | Status | Notes |
|---|---|---|
when, coalesce |
✅ Done | — |
ifnull / nvl |
To add | Alias for coalesce |
nullif |
To add | Returns null if equal |
nanvl |
To add | Replace NaN with value |
Phase 3: DataFrame Methods (3–4 weeks) ✅ COMPLETED¶
Goal: Implement methods needed for Sparkless DataFrame pipelines.
| Method | Polars API | Status |
|---|---|---|
union / unionAll |
concat (LazyFrame) |
✅ Done |
unionByName |
concat with schema alignment by name |
✅ Done |
distinct / dropDuplicates |
LazyFrame.unique() |
✅ Done |
drop |
DataFrame.select() (exclude columns) |
✅ Done |
dropna |
LazyFrame.drop_nulls() |
✅ Done |
fillna |
LazyFrame.with_columns + fill_null |
✅ Done |
limit |
DataFrame.head(n) |
✅ Done |
withColumnRenamed |
DataFrame.rename() |
✅ Done |
replace |
LazyFrame.replace() |
Medium |
crossJoin |
LazyFrame.join(..., how=Cross) |
Medium |
describe |
LazyFrame.describe() |
Medium |
cache / persist |
Materialize and store; unpersist |
Medium |
subtract / except |
Anti-join or set diff | Medium |
intersect / intersectAll |
Set operations | Low |
Phase 4: PyO3 Bridge (4–6 weeks) ✅ COMPLETED¶
Goal: Enable Sparkless (Python) to call robin-sparkless (Rust) for execution.
4.1 Crate Layout¶
- [x] Optional
pyo3feature in main crate;src/python/mod.rscompiled whenpyo3enabled - [x] Expose
SparkSession,SparkSessionBuilder,DataFrame,Column,GroupedDataas Python classes (PySpark-like names) - [x] Data transfer:
create_dataframe(list of 3-tuples);collect→ list of dicts
4.2 API Surface¶
- [x]
SparkSession.builder().app_name(...).get_or_create(),create_dataframe,read_csv,read_parquet,read_json - [x]
DataFrame:filter,select,with_column,order_by,group_by,join,union,union_by_name,distinct,drop,dropna,fillna,limit,with_column_renamed,count,show,collect - [x]
Columnand module-level:col,lit,when().then().otherwise(),coalesce,sum,avg,min,max,count - [x]
GroupedData:count(),sum(column),avg(column),min(column),max(column),agg(exprs)
4.3 Sparkless Integration (out of scope in this repo)¶
- [ ] Sparkless
BackendFactoryadds "robin" backend option (implemented in Sparkless repo) - [ ] When "robin" selected, Sparkless delegates to robin-sparkless via PyO3
- [ ] Fallback: if robin-sparkless doesn't support an op, raise or fall back to Python Polars
See EMBEDDING.md for the API surface and bindings contract Sparkless maintainers need.
4.4 Risks¶
- Schema round-trip: PySpark/Sparkless types ↔ Polars ↔ Arrow; ensure nullability and types align
- Performance: PyO3 overhead vs. native Python Polars; benchmark
- Error handling: Rust errors → Python exceptions with useful messages
Phase 5: Test Conversion (2–3 weeks)¶
Goal: Run Sparkless parity tests against robin-sparkless backend.
- [x] Fixture converter produces robin-sparkless fixtures from Sparkless expected_outputs (join, window, withColumn, union, distinct, drop, dropna, fillna, limit, withColumnRenamed; output to
tests/fixtures/converted/with--output-subdir) - [x] Identify tests that use only supported ops; run those first (run
make sparkless-paritywithSPARKLESS_EXPECTED_OUTPUTSset when Sparkless repo available) - [x] CI:
make sparkless-parityruns converted tests (converter when path set, thencargo test pyspark_parity_fixtures; parity discoverstests/fixtures/andtests/fixtures/converted/) - [x] Target: 50+ tests passing on robin-sparkless (93 hand-written passing; document in SPARKLESS_PARITY_STATUS.md; add converted when Sparkless expected_outputs used)
- [x] Document which tests fail and why (missing function, semantic difference) in SPARKLESS_PARITY_STATUS.md; fixtures can use
skip: true+skip_reason
Phase 6: Broad Function Parity (8–12 weeks)¶
Goal: Implement remaining high-usage functions from PYSPARK_FUNCTION_MATRIX.
6.1 Array Functions (Phase 6a done)¶
- [x]
array,array_contains,array_join,array_max,array_min - [x]
array_size,array_sort,element_at,explode - [x]
array_slice,size(alias for array_size) - [x]
array_position,array_remove,posexplode(implemented via Polars list.eval; Rust + Python) - [x]
array_repeat,array_flatten(Phase 8; implemented via Expr::map UDFs) - [x]
array_exists,array_forall,array_filter,array_transform,array_sum,array_mean(list.eval / list_any_all) - [ ]
aggregate(array_aggregate; optional follow-up)
6.2 Map Functions ✅ (Phase 8)¶
- Map represented as
List(Struct{key, value}). Phase 8 completed:create_map(as_struct + concat_list),map_keys/map_values(list.eval + struct.field_by_name),map_entries(identity),map_from_arrays(zip UDF).
6.3 JSON & Binary → Phase 8¶
- JSON/binary deferred to Phase 8. Polars JSON support is behind optional features.
- Phase 8:
get_json_object,from_json,to_json,base64,unbase64, etc.
6.4 Additional String ✅¶
- [x]
regexp_extract_all,regexp_like(Phase 6e) - [x]
regexp_replace(already present) - [x]
mask,translate,substring_index(Phase 10) - [x]
soundex,levenshtein,crc32,xxhash64(Phase 8; UDFs via strsim, crc32fast, twox-hash, soundex crates)
6.5 Window Extensions (partial; fixture simplification → Phase 8)¶
- [x]
percent_rank,first_value,last_value(Phase 6d) - [x] API
cume_dist,ntile,nth_value(Rust + Python; parity fixtures skipped: Polars does not allow combining rank().over() and count().over() in one expr) - Phase 8: Enable percent_rank/cume_dist/ntile/nth_value parity fixtures when Polars allows, or keep multi-step workaround
6.6 Deferred / Out of Scope¶
- UDFs: Python UDFs require Python runtime; document as out of scope; consider pure-Rust UDFs
- SQL: Phase 7.1 ✅ — optional
sqlfeature: sqlparser + translation to DataFrame ops; temp views - Delta Lake: Phase 7.2 ✅ — optional
deltafeature: read_delta, write_delta, time travel - XML:
xpath_*; low priority - Specialized:
histogram_numeric,hll_*,count_min_sketch; defer
Phase 7: SQL & Advanced (Ongoing)¶
7.1 SQL Support (Optional) ✅¶
- [x]
SparkSession::sql(query)→ parse SQL, translate to DataFrame ops (whensqlfeature enabled) - [x] Use
sqlparserfor parsing; execution via existing DataFrame API - [x] Support: SELECT, FROM (single table or JOIN), WHERE, GROUP BY + aggregates (COUNT, SUM, AVG, MIN, MAX), ORDER BY, LIMIT
- [x] Temporary views:
createOrReplaceTempView,table() - [x] In-memory saved tables:
df.write().saveAsTable(name, mode),write_delta_table(name); resolution (temp view then saved table) fortable(name)andread_delta(name); cataloglistTables,tableExists,dropTempView,dropTable. Tables are session-scoped (no disk persistence). - Limits (first iteration): No subqueries in FROM, no CTEs, no DDL, no HAVING; only unqualified table names; document unsupported constructs with clear errors.
7.2 Delta Lake (Optional) ✅¶
- [x] Read/write Delta tables (optional
deltafeature; deltalake + Polars) - [x] Time travel:
read_delta_with_version(path, version)(read by version) - [x] Table by name:
read_delta(name_or_path)(path → Delta on disk; name → in-memory table whensqlenabled);write_delta_table(name)registers forread_delta(name) - [x] Overwrite/append:
write_delta(path, overwrite) - [ ] Schema evolution, MERGE (deferred; document as Phase 7 follow-up)
7.3 Performance & Robustness ✅¶
- [x] Benchmarks: robin-sparkless vs. plain Polars (
cargo bench; criterion) - [x] Ensure within 2x of Polars for measured pipelines
- [x] Error messages with context (column names, hints); Troubleshooting in QUICKSTART.md
- [ ] Memory profiling, large-dataset handling (optional follow-up)
Phase 8: Remaining Parity ✅ COMPLETED (February 2026)¶
Goal: Implement or enable features deferred from Phase 6.
| Item | Status |
|---|---|
| array_repeat | ✅ Implemented via Expr::map UDF (list try_apply_amortized + extend). |
| array_flatten | ✅ Implemented via Expr::map UDF (list-of-lists flatten per row). |
| Map (6b) | ✅ create_map (as_struct + concat_list), map_keys/map_values (list.eval + struct.field_by_name), map_entries (identity), map_from_arrays (zip UDF with list builder). Map represented as List(Struct{key, value}). |
| JSON (6c) | ✅ get_json_object, from_json, to_json (Phase 10). |
| String 6.4 | ✅ mask, translate, substring_index (Phase 10); soundex, levenshtein, crc32, xxhash64 via Expr::map/map_many UDFs (strsim, crc32fast, twox-hash, soundex crates). |
| Window fixture simplification | ✅ percent_rank, cume_dist, ntile, nth_value covered via multi-step workaround in harness. |
| Documentation of differences | ✅ PYSPARK_DIFFERENCES.md updated; no Phase 8 stubs remaining. |
Success Metrics¶
| Metric | Current | After Phase 19 | After Phase 24 (full parity) | After Phase 26 (crate) | Full Backend (Phase 27) |
|---|---|---|---|---|---|
| Parity fixtures | 159 | 159 | 180+ | 180+ | 180+ |
| Functions implemented | ~283 | ~283 | ~400 | ~400 | ~400 |
| DataFrame methods | ~55+ | ~55+ | ~55+ | ~55+ | 85 |
| Crate on crates.io | No | — | — | Yes | Yes |
| Sparkless tests passing (robin backend) | 0 | — | — | — | 200+ |
| PyO3 bridge | ✅ Yes (optional) | Yes | Yes | Yes | Yes |
Path to 100% Before Sparkless Integration (ROADMAP Phases 12–22)¶
To reach 100% feature parity and a published crate before wiring the robin backend into Sparkless, ROADMAP.md defines the following phases between Phase 11 (done) and integration:
| ROADMAP Phase | Goal | Est. Effort |
|---|---|---|
| 12 | DataFrame methods parity: Implement remaining ~50–60 methods → 85 total | ✅ COMPLETED |
| 13 | Functions batch 1: String, binary, collection (ascii, base64, overlay, sha1, sha2, md5, array_compact, etc.) | ✅ COMPLETED |
| 14 | Functions batch 2: Math, datetime, type/conditional (sin/cos/tan, quarter, add_months, cast, try_cast, greatest, least) | ✅ COMPLETED |
| 15 | Functions batch 3 ✅ COMPLETED: Batches 1–4 (aliases, string left/right/replace/like, math cosh/cbrt/etc., array_distinct); 88 fixtures. | — |
| 16 | Remaining gaps 1 ✅ COMPLETED: String/regex (regexp_count, regexp_instr, regexp_substr, split_part, find_in_set, format_string, printf); 93 fixtures. | — |
| 17 | Remaining gaps 2 ✅ COMPLETED: Datetime/unix (unix_timestamp, from_unixtime, make_date, timestamp_*, pmod, factorial). | — |
| 18 | Remaining gaps 3 ✅ COMPLETED: array/map/struct (map_filter, zip_with, map_zip_with); 124 fixtures. | — |
| 19 | Remaining gaps 4 ✅ COMPLETED: aggregates, try_*, misc; 128 fixtures. | — |
| 20 | Full parity 1: ordering, aggregates, numeric | ✅ COMPLETED |
| 21 | Full parity 2: string, binary, type, array/map/struct | ✅ COMPLETED |
| 22 | Full parity 3: datetime extensions | ✅ COMPLETED |
| 23 | Full parity 4: JSON, CSV, URL, misc | ✅ COMPLETED |
| 24 | Full parity 5: bit, control, JVM stubs, random, crypto | ✅ COMPLETED |
| 25 | Publish Rust crate: crates.io, API stability, docs, release; optional PyPI wheel | 2–3 weeks |
| 26 | Sparkless integration: BackendFactory "robin", 200+ tests passing, PyO3 surface | 4–6 weeks |
Detail for each phase is in ROADMAP.md (§ Phase 12–26).
Implementation Order (Summary)¶
- Phase 1: Fixture converter, case sensitivity, structural split ✅
- Phase 2: String (length, trim, regexp_*), datetime (to_date, date_add, etc.), math (stddev, variance) ✅
- Phase 3: union, unionByName, distinct, drop, dropna, fillna, limit, withColumnRenamed ✅
- Phase 4: PyO3 bridge ✅ COMPLETED
- Phase 5: Convert Sparkless tests, CI integration
- Phase 6: Array (6a ✅; array_position, array_remove, posexplode implemented; array_repeat → Phase 8), Map (6b → Phase 8), JSON (6c → Phase 8), additional string (6e ✅; 6.4 → Phase 8), window extensions (6d ✅; fixture simplification → Phase 8).
- Phase 7: SQL, Delta, performance ✅ Completed (optional features; see §7)
- Phase 8: ✅ COMPLETED – array_repeat, array_flatten, Map (6b), String 6.4 (soundex/levenshtein/crc32/xxhash64), window fixtures, documentation (see Phase 8 section above)
- Phase 9: High-value functions (datetime, string repeat/reverse/lpad/rpad, math sqrt/pow/exp/log, nvl/nullif/nanvl, first/last/approx_count_distinct) + DataFrame methods (replace, cross_join, describe, cache/persist/unpersist, subtract, intersect) ✅ COMPLETED
- Phase 10: Complex types (Map, JSON, array_repeat, string 6.4) + window fixture simplification ✅ COMPLETED
- Phase 11–25: Parity scale (159 fixtures), harness date/datetime/boolean, Phase 12 DataFrame methods, Phase 13–17 functions batches, Phase 18 array/map/struct, Phase 19 aggregates, try_, misc; Phase 20 ordering, aggregates, numeric; Phase 21 string, binary, type, array/map/struct; Phase 22 datetime extensions; Phase 23 JSON/CSV/URL/misc; Phase 24 ✅ (bit, control, JVM stubs, rand/randn, AES crypto); Phase 25 ✅ COMPLETED* (plan interpreter, expression interpreter for all scalar functions, LOGICAL_PLAN_FORMAT.md, 3 plan fixtures, create_dataframe_from_rows). Phase 26 (publish), Phase 27 (integration). See ROADMAP.md.
- ROADMAP Phase 12–27: Path to 100% — Phases 12–25 done (DataFrame methods ~55+, ~283 functions, 159 fixtures; plan interpreter, expression interpreter for all scalar functions, 3 plan fixtures in
tests/fixtures/plans/). Phase 26: prepare and publish crate (crates.io, docs, release). Phase 27: Sparkless integration (see § Path to 100% above).
Related Docs¶
- ROADMAP.md – High-level roadmap and current status
- PARITY_STATUS.md – Parity matrix and fixtures
- SPARKLESS_INTEGRATION_ANALYSIS.md – Architecture mapping
- SPARKLESS_REFACTOR_PLAN.md – Refactor plan for Sparkless (serializable logical plan)
- READINESS_FOR_SPARKLESS_PLAN.md – Robin-sparkless readiness (plan interpreter, fixtures) before merge
- TEST_CREATION_GUIDE.md – How to add fixtures