Signature alignment tasks: Robin-Sparkless → PySpark¶

Checklist derived from SIGNATURE_GAP_ANALYSIS.md / signature_comparison.json. Goal: make Python parameter names and optional args match PySpark so existing PySpark call sites work unchanged.

Status (February 2026): Section 1 (column→col renames), Section 2 (optional params), Section 3 (param count/shape), and Section 4 (other renames) are complete. All optional-parameter behaviors are implemented (assert_true errMsg, like/ilike escapeChar, months_between roundOff, parse_url key, make_timestamp timezone, to_char/to_timestamp format, when(condition, value)); parity fixtures added for each. PyO3 signatures and into_py→into_py_any fixes applied.

How to apply: - In src/python/mod.rs, either (1) rename the #[pyfunction] parameter to the PySpark name, or (2) add #[pyo3(signature = (col, ...))] and keep the Rust param name. - For "Add optional", add the parameter with the same default as PySpark (check PySpark SQL API).

1. Simple rename: `column` → `col` (single-arg unary)¶

These functions only need the first (and only) parameter renamed from column to col in the Python signature.

[ ]	Function
[x]	`acos`
[x]	`acosh`
[x]	`array_agg`
[x]	`array_compact`
[x]	`array_distinct`
[x]	`asc`
[x]	`asc_nulls_first`
[x]	`asc_nulls_last`
[x]	`ascii`
[x]	`asin`
[x]	`asinh`
[x]	`atan`
[x]	`atanh`
[x]	`avg`
[x]	`base64`
[x]	`bin`
[x]	`bit_count`
[x]	`bit_get`
[x]	`bit_length`
[x]	`bitwise_not`
[x]	`bround`
[x]	`cbrt`
[x]	`ceiling`
[x]	`char`
[x]	`cos`
[x]	`cosh`
[x]	`cot`
[x]	`count`
[x]	`csc`
[x]	`day`
[x]	`dayofmonth`
[x]	`dayofweek`
[x]	`dayofyear`
[x]	`degrees`
[x]	`desc`
[x]	`desc_nulls_first`
[x]	`desc_nulls_last`
[x]	`explode_outer`
[x]	`expm1`
[x]	`factorial`
[x]	`getbit`
[x]	`hex`
[x]	`isnan`
[x]	`isnotnull`
[x]	`isnull`
[x]	`ln`
[x]	`log10`
[x]	`log1p`
[x]	`log2`
[x]	`map_from_entries`
[x]	`max`
[x]	`md5`
[x]	`median`
[x]	`min`
[x]	`mode`
[x]	`negate`
[x]	`negative`
[x]	`positive`
[x]	`quarter`
[x]	`radians`
[x]	`rint`
[x]	`sec`
[x]	`sha1`
[x]	`signum`
[x]	`sin`
[x]	`sinh`
[x]	`stddev_pop`
[x]	`sum`
[x]	`tan`
[x]	`tanh`
[x]	`timestamp_micros`
[x]	`timestamp_millis`
[x]	`timestamp_seconds`
[x]	`to_degrees`
[x]	`to_radians`
[x]	`typeof`
[x]	`unbase64`
[x]	`unhex`
[x]	`unix_date`
[x]	`unix_micros`
[x]	`unix_millis`
[x]	`unix_seconds`
[x]	`var_pop`
[x]	`weekday`
[x]	`weekofyear`

Total: 85

2. Add optional parameter(s)¶

PySpark has one or more optional parameters that robin-sparkless is missing. Add the param(s) with PySpark's default.

[ ]	Function	PySpark signature	Robin today	Add
[x]	`assert_true`	`assert_true(col, errMsg)`	`assert_true(column)`	errMsg
[x]	`ilike`	`ilike(str, pattern, escapeChar)`	`ilike(column, pattern)`	escapeChar
[x]	`like`	`like(str, pattern, escapeChar)`	`like(column, pattern)`	escapeChar
[x]	`make_timestamp`	`make_timestamp(years, months, days, hours, mins, secs, timezone)`	`make_timestamp(year, month, day, hour, minute, sec)`	timezone
[x]	`months_between`	`months_between(date1, date2, roundOff=True)`	`months_between(end, start)`	roundOff
[x]	`parse_url`	`parse_url(url, partToExtract, key)`	`parse_url(column, part)`	key
[x]	`position`	`position(substr, str, start)`	`position(substr, column)`	start
[x]	`to_char`	`to_char(col, format)`	`to_char(column)`	format
[x]	`to_number`	`to_number(col, format)`	`to_number(column)`	format
[x]	`to_timestamp`	`to_timestamp(col, format)`	`to_timestamp(column)`	format
[x]	`to_varchar`	`to_varchar(col, format)`	`to_varchar(column)`	format
[x]	`try_to_number`	`try_to_number(col, format)`	`try_to_number(column)`	format
[x]	`try_to_timestamp`	`try_to_timestamp(col, format)`	`try_to_timestamp(column)`	format
[x]	`when`	`when(condition, value)`	`when(condition)`	value

Total: 14

3. Param count / shape differs (review manually)¶

PySpark and robin-sparkless have different number of parameters (e.g. variadic vs two args). Check PySpark docs and decide mapping.

[ ]	Function	PySpark	Robin
[x]	`arrays_zip`	`cols`	`left, right`
[x]	`bit_and`	`col`	`left, right`
[x]	`bit_or`	`col`	`left, right`
[x]	`bit_xor`	`col`	`left, right`
[x]	`elt`	`inputs`	`index, columns`
[x]	`json_array_length`	`col`	`column, path`
[x]	`map_concat`	`cols`	`a, b`
[x]	`named_struct`	`cols`	`names, columns`

Total: 8

4. Other renames (multi-param or different names)¶

[ ]	Function	PySpark	Robin	Action
[x]	`add_months`	`start, months`	`column, n`	column → start; n → months
[x]	`array_append`	`col, value`	`array, elem`	array → col; elem → value
[x]	`array_except`	`col1, col2`	`a, b`	a → col1; b → col2
[x]	`array_insert`	`arr, pos, value`	`array, pos, elem`	array → arr; elem → value
[x]	`array_intersect`	`col1, col2`	`a, b`	a → col1; b → col2
[x]	`array_prepend`	`col, value`	`array, elem`	array → col; elem → value
[x]	`array_union`	`col1, col2`	`a, b`	a → col1; b → col2
[x]	`arrays_overlap`	`a1, a2`	`left, right`	left → a1; right → a2
[x]	`atan2`	`col1, col2`	`y, x`	y → col1; x → col2
[x]	`btrim`	`str, trim`	`column, trim_str`	column → str; trim_str → trim
[x]	`coalesce`	`cols`	`columns`	columns → cols
[x]	`col`	`col`	`name`	name → col
[x]	`contains`	`left, right`	`column, substring`	column → left; substring → right
[x]	`conv`	`col, fromBase, toBase`	`column, from_base, to_base`	column → col; from_base → fromBase; to_base → toBase
[x]	`convert_timezone`	`sourceTz, targetTz, sourceTs`	`source_tz, target_tz, column`	source_tz → sourceTz; target_tz → targetTz; column → sourceTs
[x]	`create_map`	`cols`	`columns`	columns → cols
[x]	`date_from_unix_date`	`days`	`column`	column → days
[x]	`date_part`	`field, source`	`column, field`	column → field; field → source
[x]	`dateadd`	`start, days`	`column, n`	column → start; n → days
[x]	`datepart`	`field, source`	`column, field`	column → field; field → source
[x]	`days`	`col`	`n`	n → col
[x]	`endswith`	`str, suffix`	`column, suffix`	column → str
[x]	`equal_null`	`col1, col2`	`left, right`	left → col1; right → col2
[x]	`extract`	`field, source`	`column, field`	column → field; field → source
[x]	`find_in_set`	`str, str_array`	`str_column, set_column`	str_column → str; set_column → str_array
[x]	`format_number`	`col, d`	`column, decimals`	column → col; decimals → d
[x]	`format_string`	`format, cols`	`format, columns`	columns → cols
[x]	`from_unixtime`	`timestamp, format='yyyy-MM-dd HH:mm:ss'`	`column, format`	column → timestamp
[x]	`from_utc_timestamp`	`timestamp, tz`	`column, tz`	column → timestamp
[x]	`get`	`col, index`	`map_col, key`	map_col → col; key → index
[x]	`greatest`	`cols`	`columns`	columns → cols
[x]	`hash`	`cols`	`columns`	columns → cols
[x]	`hours`	`col`	`n`	n → col
[x]	`hypot`	`col1, col2`	`x, y`	x → col1; y → col2
[x]	`ifnull`	`col1, col2`	`column, value`	column → col1; value → col2
[x]	`lcase`	`str`	`column`	column → str
[x]	`least`	`cols`	`columns`	columns → cols
[x]	`left`	`str, len`	`column, n`	column → str; n → len
[x]	`lit`	`col`	`value`	value → col
[x]	`locate`	`substr, str, pos=1`	`substr, column, pos`	column → str
[x]	`make_timestamp_ntz`	`years, months, days, hours, mins, secs`	`year, month, day, hour, minute, sec`	year → years; month → months; day → days; hour → hours; minute → mins; sec → secs
[x]	`map_contains_key`	`col, value`	`map_col, key`	map_col → col; key → value
[x]	`months`	`col`	`n`	n → col
[x]	`next_day`	`date, dayOfWeek`	`column, day_of_week`	column → date; day_of_week → dayOfWeek
[x]	`nvl`	`col1, col2`	`column, value`	column → col1; value → col2
[x]	`overlay`	`src, replace, pos, len=-1`	`column, replace, pos, length`	column → src; length → len
[x]	`power`	`col1, col2`	`column, exp`	column → col1; exp → col2
[x]	`printf`	`format, cols`	`format, columns`	columns → cols
[x]	`raise_error`	`errMsg`	`message`	message → errMsg
[x]	`regexp_count`	`str, regexp`	`column, pattern`	column → str; pattern → regexp
[x]	`regexp_instr`	`str, regexp, idx`	`column, pattern, group_idx`	column → str; pattern → regexp; group_idx → idx
[x]	`regexp_substr`	`str, regexp`	`column, pattern`	column → str; pattern → regexp
[x]	`replace`	`src, search, replace`	`column, search, replacement`	column → src; replacement → replace
[x]	`right`	`str, len`	`column, n`	column → str; n → len
[x]	`rlike`	`str, regexp`	`column, pattern`	column → str; pattern → regexp
[x]	`sha2`	`col, numBits`	`column, bit_length`	column → col; bit_length → numBits
[x]	`shift_left`	`col, numBits`	`column, n`	column → col; n → numBits
[x]	`shift_right`	`col, numBits`	`column, n`	column → col; n → numBits
[x]	`split_part`	`src, delimiter, partNum`	`column, delimiter, part_num`	column → src; part_num → partNum
[x]	`stack`	`cols`	`columns`	columns → cols
[x]	`startswith`	`str, prefix`	`column, prefix`	column → str
[x]	`str_to_map`	`text, pairDelim, keyValueDelim`	`column, pair_delim, key_value_delim`	column → text; pair_delim → pairDelim; key_value_delim → keyValueDelim
[x]	`struct`	`cols`	`columns`	columns → cols
[x]	`substr`	`str, pos, len`	`column, start, length`	column → str; start → pos; length → len
[x]	`to_unix_timestamp`	`timestamp, format`	`column, format`	column → timestamp
[x]	`to_utc_timestamp`	`timestamp, tz`	`column, tz`	column → timestamp
[x]	`ucase`	`str`	`column`	column → str
[x]	`unix_timestamp`	`timestamp, format='yyyy-MM-dd HH:mm:ss'`	`column, format`	column → timestamp
[x]	`url_decode`	`str`	`column`	column → str
[x]	`url_encode`	`str`	`column`	column → str
[x]	`width_bucket`	`v, min, max, numBucket`	`value, min_val, max_val, num_bucket`	value → v; min_val → min; max_val → max; num_bucket → numBucket
[x]	`years`	`col`	`n`	n → col

Total: 72

Summary¶

column → col only: 85
Add optional param(s): 14
Param count differs: 8
Other renames: 72
Total partial (to align): 179

Regenerate: python scripts/write_signature_alignment_tasks.py (after compare_signatures.py).

After this: unimplemented features¶

Follow-up items for behavior tied to the added optional parameters and to Section 3 (param count differs). Signature alignment is done first; these can be implemented later.

From Section 2 (optional params)¶

[x] assert_true(col, errMsg): Use errMsg in the error message when assertion fails.
[x] ilike(str, pattern, escapeChar): Implement escape-character semantics when escapeChar is provided.
[x] like(str, pattern, escapeChar): Same as ilike for escapeChar.
[x] make_timestamp(..., timezone): Use timezone when constructing timestamp (e.g. timezone-aware result).
[x] months_between(date1, date2, roundOff): Use roundOff to control rounding of the result.
[x] parse_url(..., key): Use key when extracting a specific query parameter (or similar).
[x] position(substr, str, start): Use start as the 1-based start position for search.
[x] to_char(col, format): Use format for datetime formatting (PySpark-style mapped to chrono strftime).
[x] to_number(col, format): Signature accepts format; reserved for future format-based parsing.
[x] to_timestamp(col, format): Use format for string→timestamp parsing (PySpark-style format).
[x] to_varchar(col, format): Same as to_char.
[x] try_to_number(col, format): Signature accepts format; reserved for future.
[x] try_to_timestamp(col, format): Use format for string→timestamp parsing; null on invalid.
[x] when(condition, value): Two-arg form returns value where condition is true, null otherwise (single-branch when).

From Section 3 (param count differs)¶

[x] arrays_zip: Documented: we support two columns. PySpark variadic *cols; use chaining or two-arg form.
[x] bit_and / bit_or / bit_xor: Documented: we provide element-wise two-column semantics; PySpark uses single-column (aggregate). Use with two columns for element-wise.
[x] elt: Documented: we have elt(index, columns) (list of columns); equivalent to PySpark variadic *inputs by passing columns as a list.
[x] json_array_length: Documented: we have (column, path=None); path is optional, matching PySpark when only col is used.
[x] map_concat: Documented: we support two map columns; PySpark variadic *cols; use two-arg form or chain.
[x] named_struct: Documented: we have (names: Vec<String>, columns: Vec<Column>) (parallel lists); equivalent to PySpark’s alternating name1, col1, name2, col2, ….