Parity Functions: Test Expectations and PySpark Alignment¶

Test expectations match PySpark¶

Expected outputs under tests/expected_outputs/ are generated from PySpark via tests/tools/generate_expected_outputs.py. They reflect PySpark behavior and are the source of truth for parity.

Engine fixes applied (parity/functions)¶

initcap: Implemented as a UDF that title-cases each word (first letter uppercase, rest lowercase) so behavior matches PySpark. Polars 0.53 has no to_titlecase.
xxhash64: Uses seed 42 in apply_xxhash64 to align with Spark’s XXH64 seed; hash values now match PySpark for the same inputs.
json_tuple: Key arguments are coerced to strings when possible (e.g. o.extract::<String>().unwrap_or_else(|_| o.to_string())) so keys from plans or other types don’t raise "keys must be strings".

Fixes applied (literal replication, explode, log)¶

Literal replication: When every expression in select() references no column from the frame, the engine now cross-joins a single key column with the literal-derived result so the row count matches the input (N rows). Fixes get_json_object, math_exp (and similar literal-only selects).
Split + explode: When withColumn(name, F.explode(F.col(name))) replaces a list column with its exploded form, the engine now uses LazyFrame.explode() so other columns are replicated correctly. Fixes test_split_with_limit_parity, test_split_without_limit_parity, test_split_with_limit_minus_one_parity.
log(float base): Python log(col_or_base, base_or_col=None) now accepts PySpark’s log(base, column) and log(column, base); one argument may be a numeric base (int/float), the other a Column. Fixes test_log_with_float_base_parity, test_log_with_different_bases_parity.

Remaining parity failures (engine gaps)¶

These tests still fail because of current engine behavior; expectations are correct (PySpark-aligned).

Area	Failure	Cause
Literal replication (mixed)	`levenshtein`, `json_tuple` (and similar)	Expression references a column (e.g. `levenshtein(col("name"), lit(""))`) so the “no column refs” path does not apply; or schema/column shape differs (e.g. `json_tuple`).
Struct field alias	`test_struct_field_with_alias_*`	Selecting struct fields with aliases returns `None` for the extracted value; needs struct/alias handling to match PySpark.
Window orderBy list	`test_window_orderby_list_multiple_columns_parity`	`Window.orderBy([list of columns])` ordering differs from PySpark.
array_contains with column	`test_array_contains_join_*`	`list.contains` with a column argument (join) fails: type/plan handling for "column in list" in join/filter.
Array/column dtype	Various `test_array_*`	Wrong column used (e.g. list op on string "tags"), output naming, or row counts (explode/array_union).

Regenerating expected outputs from PySpark will not fix these; they require engine/plan/API changes in the Rust/Python layers.