Gap Analysis: Robin-Sparkless vs PySpark (from source)¶
This document compares robin-sparkless with Apache PySpark using API surface extracted directly from the PySpark source repository.
Method¶
- PySpark API: Extracted from Apache Spark repo via
scripts/extract_pyspark_api_from_repo.py(AST parsing ofpyspark.sqlsources). - PySpark version/branch: 18 (branch/tag: v3.5.0)
- Robin-sparkless API: From signatures_robin_sparkless.json (introspection)
- Scope:
pyspark.sql(functions, DataFrame, Column, GroupedData, SparkSession, Reader, Writer, Window).
Summary¶
Functions (pyspark.sql.functions)¶
| Classification | Count | Description |
|---|---|---|
| exact | 214 | Same parameter names, order, and defaults |
| compatible | 0 | Same params/defaults; types may differ |
| partial | 30 | Different param names or counts |
| missing | 171 | In PySpark but not in robin-sparkless |
| extra | 13 | In robin-sparkless only (extensions) |
- PySpark functions: 415
- Robin-sparkless functions: 257
Class methods¶
| Class | Exact | Partial | Missing | Extra |
|---|---|---|---|---|
| SparkSession | 1 | 2 | 25 | 9 |
| DataFrame | 16 | 31 | 52 | 13 |
| Column | 0 | 6 | 9 | 131 |
| GroupedData | 2 | 4 | 2 | 24 |
| DataFrameReader | 0 | 0 | 12 | 0 |
| DataFrameWriter | 0 | 0 | 16 | 0 |
| Window | 0 | 0 | 4 | 0 |
| Catalog | 0 | 0 | 27 | 0 |
Function details (sample)¶
Exact match¶
abs(col)acos(col)acosh(col)add_months(start, months)array(cols)array_agg(col)array_append(col, value)array_compact(col)array_contains(col, value)array_distinct(col)array_except(col1, col2)array_insert(arr, pos, value)array_intersect(col1, col2)array_max(col)array_min(col)array_position(col, value)array_prepend(col, value)array_size(col)array_union(col1, col2)arrays_overlap(a1, a2)asc(col)asc_nulls_first(col)asc_nulls_last(col)ascii(col)asin(col)asinh(col)assert_true(col, errMsg='None')atan(col)atan2(col1, col2)atanh(col)- ... and 184 more
Partial (param mismatch)¶
| PySpark | Robin |
|---|---|
aggregate(col, initialValue, merge, finish='None') |
aggregate(col, zero) |
array_join(col, delimiter, null_replacement='None') |
array_join(col, delimiter) |
array_sort(col, comparator='None') |
array_sort(col) |
arrays_zip(cols) |
arrays_zip(col1, col2) |
bit_and(col) |
bit_and(col1, col2) |
bit_or(col) |
bit_or(col1, col2) |
bit_xor(col) |
bit_xor(col1, col2) |
char_length(str) |
char_length(col) |
character_length(str) |
character_length(col) |
date_add(start, days) |
date_add(col, days) |
date_format(date, format) |
date_format(col, format) |
date_sub(start, days) |
date_sub(col, days) |
date_trunc(format, timestamp) |
date_trunc(format, col) |
elt(inputs) |
elt(index, cols) |
from_csv(col, schema, options='None') |
from_csv(col) |
from_unixtime(timestamp, format="'yyyy-MM-dd HH:mm:ss'") |
from_unixtime(timestamp, format) |
json_array_length(col) |
json_array_length(col, path) |
log(col) |
log(col, base) |
map_concat(cols) |
map_concat(col1, col2) |
named_struct(cols) |
named_struct(names, columns) |
overlay(src, replace, pos, len='-1') |
overlay(src, replace, pos, len='Ellipsis') |
regexp_extract_all(str, regexp, idx='None') |
regexp_extract_all(str, regexp, idx=0) |
schema_of_csv(csv, options='None') |
schema_of_csv(col) |
schema_of_json(json, options='None') |
schema_of_json(col) |
split(str, pattern, limit='-1') |
split(src, delimiter) |
| ... and 5 more |
Missing (PySpark only)¶
aes_decrypt(input, key, mode='None', padding='None', aad='None')aes_encrypt(input, key, mode='None', padding='None', iv='None', aad='None')any_value(col, ignoreNulls='None')approx_count_distinct(col, rsd='None')approx_percentile(col, percentage, accuracy='10000')array_remove(col, element)array_repeat(col, count)bitmap_bit_position(col)bitmap_bucket_number(col)bitmap_construct_agg(col)bitmap_count(col)bitmap_or_agg(col)bool_and(col)bool_or(col)bucket(numBuckets, col)call_function(funcName, cols)call_udf(udfName, cols)collect_list(col)collect_set(col)concat(cols)concat_ws(sep, cols)corr(col1, col2)count_distinct(col, cols)count_if(col)count_min_sketch(col, eps, confidence, seed)covar_pop(col1, col2)covar_samp(col1, col2)crc32(col)cume_dist()datediff(end, start)decode(col, charset)dense_rank()element_at(col, extraction)encode(col, charset)every(col)exists(col, f)exp(col)explode(col)expr(str)filter(col, f)first(col, ignorenulls='False')first_value(col, ignoreNulls='None')flatten(col)floor(col)forall(col, f)from_json(col, schema, options='None')grouping(col)grouping_id(cols)histogram_numeric(col, nBins)hll_sketch_agg(col, lgConfigK='None')- ... and 121 more
Extra (robin-sparkless only)¶
bitwiseNOT(col)cast(col, type_name)chr(col)dayname(col)isin(col, other)minutes(n)negate(col)power(col1, col2)shiftLeft(col, numBits)shiftRight(col, numBits)timestampadd(unit, amount, ts)timestampdiff(unit, start, end)try_cast(col, type_name)
Semantic annotations¶
Items tagged from docs/gap_annotations.json and PYSPARK_DIFFERENCES.md:
stub (no-op or placeholder):
broadcastcurrent_catalogcurrent_databasecurrent_schemacurrent_usergroupinggrouping_idinput_file_namemonotonically_increasing_idspark_partition_iduserisStreamingis_streamingpersiststorageLevelstorage_levelunpersistwithWatermarkwith_watermarkcurrent_catalogcurrent_database
diverges (behavior differs from PySpark):
aes_decryptaes_encryptassert_truefrom_unixtimefrom_utc_timestampraise_errorrandrandnto_utc_timestamptry_aes_decryptunix_timestampfrom_unixtimeunix_timestamp
deferred (out of scope):
call_udfcount_min_sketchhistogram_numerichll_sketch_agghll_sketch_estimatehll_unionhll_union_aggsentencessession_windowudfudtfxpathxpath_booleanxpath_doublexpath_floatxpath_intxpath_longxpath_numberxpath_shortxpath_stringudfudtfforeachforeachPartitionforeach_partition
Parity fixture coverage: see PARITY_STATUS.md.