createDataFrame: Missing Features vs PySpark¶
Comparison of robin-sparkless createDataFrame(data, schema=None, sampling_ratio=None, verify_schema=True) with PySpark to identify gaps.
Parameters (signature)¶
| Feature | PySpark | Robin-sparkless |
|---|---|---|
data |
RDD, list, pandas.DataFrame, numpy.ndarray, pyarrow.Table | ✅ list (dicts or list/tuple rows), pandas, numpy, pyarrow (via normalize_create_dataframe_input in python/src/lib.rs); ✅ RDD via collect-to-list |
schema=None |
✅ | ✅ |
samplingRatio=None |
✅ (used for RDD inference) | ✅ accepted, ignored for list data |
verifySchema=True |
✅ (validates each row) | ✅ accepted; validation is best-effort when building |
No missing parameters; optional args are present.
Data input types (data)¶
| Input type | PySpark | Robin-sparkless |
|---|---|---|
| List of dicts | ✅ | ✅ |
| List of list/tuple | ✅ | ✅ |
| List of Row | ✅ (schema inferred from Row) | ⚠️ May work if Row is sequence-like (extracts as list); not explicitly supported. Dict-like Row (e.g. row["name"]) is not tried. |
| List of namedtuple | ✅ | ⚠️ May work if extracts as list/tuple. |
| RDD | ✅ | ✅ (collects to list of dicts; no distributed RDD API) |
| pandas.DataFrame | ✅ | ✅ (to_dict("records") / column-order path in binding) |
| numpy.ndarray | ✅ | ✅ (tolist() with 1D/2D handling) |
| pyarrow.Table | ✅ (since 4.0) | ✅ (to_pylist()) |
Summary: Remaining gaps are distributed RDD, explicit Row / namedtuple dict-like access, and single-column schema as a bare type (#419).
Schema (schema)¶
| Schema form | PySpark | Robin-sparkless |
|---|---|---|
None (infer) |
✅ | ✅ (dict keys or _1,_2,…; types inferred) |
| List of column names | ✅ | ✅ |
| StructType / .fields | ✅ | ✅ |
| List of (name, type_str) | (via StructType) | ✅ |
| DDL string (flat) | ✅ e.g. "name: string, age: int" |
✅ |
| DDL string (nested) | ✅ e.g. addr struct<street:string,city:string> |
✅ (via spark-ddl-parser crate; struct, array, map, decimal) |
| Single DataType (non-struct) | ✅ Wraps as single column "value", row as tuple |
❌ Not supported. We require a full struct (multiple named columns). |
Summary: Nested DDL is now supported (0.11.0). Missing: single-column schema as a type (e.g. schema="string" or schema=StringType() with single column "value").
Behavior¶
| Behavior | PySpark | Robin-sparkless |
|---|---|---|
Empty data [] |
Requires schema (no inference). | ✅ Accepts [] with or without schema. |
| verifySchema=True | Validates every row’s types against schema. | Parameter accepted; validation is best-effort during build (Rust side may raise on type mismatch). No per-row verification step. |
| Column order (dict rows) | Follows schema or insertion order. | ✅ Follows schema when given; else first row’s key order. |
Summary: Optional improvement: strict per-row schema verification when verify_schema=True (explicit type check and clear error messages).
Recommended additions (priority)¶
| # | Addition | GitHub issue |
|---|---|---|
| 1 | Pandas DataFrame – Accept pandas.DataFrame as data; convert to list of dicts (or rows) and call existing path. High value for Python users. |
#416 |
| 2 | Row / namedtuple – Explicitly support Row (and Row-like) and namedtuple: try dict-like (e.g. get_item / keys) then sequence-like so [Row(a=1, b=2)] works. |
#417 |
| 3 | ~~Full DDL~~ – ✅ Done (0.11.0): spark-ddl-parser integrated; nested struct<>, array<>, map<> in DDL work. |
#418 |
| 4 | Single-column schema – Allow schema = single type (or string) and treat as one column named "value", wrapping each row in a single-element tuple. |
#419 |
| 5 | verify_schema – When True, add an explicit per-row type check and raise with a clear message on mismatch. | #420 |
| 6 | PyArrow Table / numpy – Lower priority; accept pyarrow.Table and numpy.ndarray as data for better parity. | #421 |