Spark DDL Parser (Rust) – Integrated¶
Status: Done. Robin-sparkless now uses the spark-ddl-parser crate from crates.io for DDL schema parsing in createDataFrame(data, schema="..."). The hand-rolled split_ddl_top_level logic has been replaced.
Original motivation¶
Port the Python spark-ddl-parser to a Rust crate so robin-sparkless (and others) can parse PySpark DDL schema strings in Rust without calling Python.
Reference¶
- Python package: spark-ddl-parser on PyPI
- Repo: https://github.com/eddiethedean/spark-ddl-parser
- API:
parse_ddl_schema(ddl_string: str) -> StructType; returns structured types (StructType, StructField, DataType variants).
Scope¶
Types to support (match Python)¶
- [ ] Simple types:
string,int,integer,long,bigint,double,float,short,smallint,byte,tinyint,boolean,bool,date,timestamp,binary - [ ] Arrays:
array<element_type>, e.g.array<string>,array<long> - [ ] Maps:
map<key_type,value_type>, e.g.map<string,int> - [ ] Structs:
struct<field:type,...>, including nested structs - [ ] Decimal:
decimal(precision,scale)e.g.decimal(10,2)
Parsing behavior¶
- [ ] Separators: Both
name type(space) andname:type(colon) - [ ] Top-level: Comma-separated list of fields; commas inside
<...>or(...)must not split (bracket-aware) - [ ] Struct fields: Same format inside
struct<...>(name/type with space or colon) - [ ] Whitespace: Trim and allow newlines between fields
- [ ] Errors: Clear messages for invalid DDL (missing type, unbalanced brackets, etc.)
Deliverables¶
- [ ] New crate (e.g.
spark-ddl-parserorspark_ddl_parser) withCargo.toml, README, LICENSE - [ ] Public API:
parse_ddl_schema(ddl: &str) -> Result<StructType, ParseError> - [ ] Type model:
StructType,StructField,DataTypeenum (Simple, Array, Map, Struct, Decimal) - [ ] Optional:
StructFieldhasnullable: bool(default true) if we want full parity - [ ] Unit tests covering: flat schema, nested struct, array, map, decimal, colon vs space, error cases
- [ ] Integration: Use the crate in robin-sparkless
parse_schema_param()for DDL strings (replace current simple split-on-comma logic)
Tasks (ordered)¶
-
[ ] Create crate layout
Addcrates/spark-ddl-parser/(or top-levelspark-ddl-parser/) withCargo.toml,src/lib.rs,README.md. Add to workspace if using a workspace. -
[ ] Define type model
DataTypeenum:Simple(String),Array(Box<DataType>),Map(Box<DataType>, Box<DataType>),Struct(StructType),Decimal(u32, u32)StructField { name: String, data_type: DataType, nullable: bool }StructType { fields: Vec<StructField> }-
Implement
Display/Debugand optionallySerializefor debugging/serialization. -
[ ] Implement lexer/tokenizer
- Tokenize: identifiers,
<,>,,,:,(,), whitespace (or skip). -
Bracket-aware splitting at top level: scan for commas at depth 0 (count
<and>and()). -
[ ] Implement parser
- Parse top-level:
field_def ("," field_def)* - Field:
identifier (":"|" ") data_type - Data type: simple word |
array "<" data_type ">"|map "<" data_type "," data_type ">"|struct "<" field_def_list ">"|decimal "(" number "," number ")" -
Recursive for nested structs/array/map.
-
[ ] Error handling
- Custom
ParseError(or use a crate likethiserror). -
Report position or snippet where parsing failed when possible.
-
[ ] Tests
- From Python package: copy or port representative tests (simple, nested struct, array, map, decimal, colon format, invalid DDL).
-
Add tests for edge cases: empty string, single field, deeply nested.
-
[ ] Documentation
- README with examples matching Python quick start.
- Doc comments on public types and
parse_ddl_schema. -
Note compatibility with PySpark DDL / spark-ddl-parser (Python).
-
[ ] Integrate into robin-sparkless
- Add dependency on
spark-ddl-parser(path or crates.io if published). - In
src/python/session.rs(or shared Rust code): when schema is a string, callspark_ddl_parser::parse_ddl_schema(ddl)and convertStructTypeto ourVec<(String, String)>(flatten or pass type strings as needed forcreate_dataframe_from_rows). - Remove or narrow the current hand-rolled DDL split logic.
-
Run existing createDataFrame tests (including DDL tests) and fix any behavior differences.
-
[ ] Publish (optional)
- Publish crate to crates.io.
- Add repository link (e.g. under same org as robin-sparkless or spark-ddl-parser Python).
Notes¶
- No external parser dependency required: A recursive-descent or hand-written parser is enough; no need for a full SQL parser.
- Output format for robin-sparkless: Our
create_dataframe_from_rowsusesVec<(String, String)>(name, dtype_str). The Rust DDL parser’sStructTypecan be converted to that by turning eachDataTypeinto a string (e.g.long,string,array<bigint>,struct<...>) so we don’t need to change the rest of the pipeline. - Python parity: Aim for same accepted DDL and same structure; type name normalization (e.g.
integer→int) can match Python for compatibility.