Skip to content

Sparkless / Robin Sparkless

CreateDataFrame gaps

eddiethedean/robin-sparkless

createDataFrame: Missing Features vs PySpark¶

Comparison of robin-sparkless createDataFrame(data, schema=None, sampling_ratio=None, verify_schema=True) with PySpark to identify gaps.

Parameters (signature)¶

Feature	PySpark	Robin-sparkless
`data`	RDD, list, pandas.DataFrame, numpy.ndarray, pyarrow.Table	✅ list (dicts or list/tuple rows), pandas, numpy, pyarrow (via `normalize_create_dataframe_input` in `python/src/lib.rs`); ✅ RDD via collect-to-list
`schema=None`	✅	✅
`samplingRatio=None`	✅ (used for RDD inference)	✅ accepted, ignored for list data
`verifySchema=True`	✅ (validates each row)	✅ accepted; validation is best-effort when building

No missing parameters; optional args are present.

Data input types (`data`)¶

Input type	PySpark	Robin-sparkless
List of dicts	✅	✅
List of list/tuple	✅	✅
List of Row	✅ (schema inferred from Row)	⚠️ May work if Row is sequence-like (extracts as list); not explicitly supported. Dict-like Row (e.g. `row["name"]`) is not tried.
List of namedtuple	✅	⚠️ May work if extracts as list/tuple.
RDD	✅	✅ (collects to list of dicts; no distributed RDD API)
pandas.DataFrame	✅	✅ (`to_dict("records")` / column-order path in binding)
numpy.ndarray	✅	✅ (`tolist()` with 1D/2D handling)
pyarrow.Table	✅ (since 4.0)	✅ (`to_pylist()`)

Summary: Remaining gaps are distributed RDD, explicit Row / namedtuple dict-like access, and single-column schema as a bare type (#419).

Schema (`schema`)¶

Schema form	PySpark	Robin-sparkless
`None` (infer)	✅	✅ (dict keys or `_1`,`_2`,…; types inferred)
List of column names	✅	✅
StructType / .fields	✅	✅
List of (name, type_str)	(via StructType)	✅
DDL string (flat)	✅ e.g. `"name: string, age: int"`	✅
DDL string (nested)	✅ e.g. `addr struct<street:string,city:string>`	✅ (via spark-ddl-parser crate; struct, array, map, decimal)
Single DataType (non-struct)	✅ Wraps as single column `"value"`, row as tuple	❌ Not supported. We require a full struct (multiple named columns).

Summary: Nested DDL is now supported (0.11.0). Missing: single-column schema as a type (e.g. schema="string" or schema=StringType() with single column "value").

Behavior¶

Behavior	PySpark	Robin-sparkless
Empty data `[]`	Requires schema (no inference).	✅ Accepts `[]` with or without schema.
verifySchema=True	Validates every row’s types against schema.	Parameter accepted; validation is best-effort during build (Rust side may raise on type mismatch). No per-row verification step.
Column order (dict rows)	Follows schema or insertion order.	✅ Follows schema when given; else first row’s key order.

Summary: Optional improvement: strict per-row schema verification when verify_schema=True (explicit type check and clear error messages).

Recommended additions (priority)¶

#	Addition	GitHub issue
1	Pandas DataFrame – Accept `pandas.DataFrame` as `data`; convert to list of dicts (or rows) and call existing path. High value for Python users.	#416
2	Row / namedtuple – Explicitly support Row (and Row-like) and namedtuple: try dict-like (e.g. `get_item` / keys) then sequence-like so `[Row(a=1, b=2)]` works.	#417
3	~~Full DDL~~ – ✅ Done (0.11.0): spark-ddl-parser integrated; nested `struct<>`, `array<>`, `map<>` in DDL work.	#418
4	Single-column schema – Allow schema = single type (or string) and treat as one column named `"value"`, wrapping each row in a single-element tuple.	#419
5	verify_schema – When True, add an explicit per-row type check and raise with a clear message on mismatch.	#420
6	PyArrow Table / numpy – Lower priority; accept pyarrow.Table and numpy.ndarray as data for better parity.	#421