Skip to content

Gap Analysis: Robin-Sparkless vs PySpark (from source)

This document compares robin-sparkless with Apache PySpark using API surface extracted directly from the PySpark source repository.

Method

  • PySpark API: Extracted from Apache Spark repo via scripts/extract_pyspark_api_from_repo.py (AST parsing of pyspark.sql sources).
  • PySpark version/branch: 18 (branch/tag: v3.5.0)
  • Robin-sparkless API: From signatures_robin_sparkless.json (introspection)
  • Scope: pyspark.sql (functions, DataFrame, Column, GroupedData, SparkSession, Reader, Writer, Window).

Summary

Functions (pyspark.sql.functions)

Classification Count Description
exact 214 Same parameter names, order, and defaults
compatible 0 Same params/defaults; types may differ
partial 30 Different param names or counts
missing 171 In PySpark but not in robin-sparkless
extra 13 In robin-sparkless only (extensions)
  • PySpark functions: 415
  • Robin-sparkless functions: 257

Class methods

Class Exact Partial Missing Extra
SparkSession 1 2 25 9
DataFrame 16 31 52 13
Column 0 6 9 131
GroupedData 2 4 2 24
DataFrameReader 0 0 12 0
DataFrameWriter 0 0 16 0
Window 0 0 4 0
Catalog 0 0 27 0

Function details (sample)

Exact match

  • abs(col)
  • acos(col)
  • acosh(col)
  • add_months(start, months)
  • array(cols)
  • array_agg(col)
  • array_append(col, value)
  • array_compact(col)
  • array_contains(col, value)
  • array_distinct(col)
  • array_except(col1, col2)
  • array_insert(arr, pos, value)
  • array_intersect(col1, col2)
  • array_max(col)
  • array_min(col)
  • array_position(col, value)
  • array_prepend(col, value)
  • array_size(col)
  • array_union(col1, col2)
  • arrays_overlap(a1, a2)
  • asc(col)
  • asc_nulls_first(col)
  • asc_nulls_last(col)
  • ascii(col)
  • asin(col)
  • asinh(col)
  • assert_true(col, errMsg='None')
  • atan(col)
  • atan2(col1, col2)
  • atanh(col)
  • ... and 184 more

Partial (param mismatch)

PySpark Robin
aggregate(col, initialValue, merge, finish='None') aggregate(col, zero)
array_join(col, delimiter, null_replacement='None') array_join(col, delimiter)
array_sort(col, comparator='None') array_sort(col)
arrays_zip(cols) arrays_zip(col1, col2)
bit_and(col) bit_and(col1, col2)
bit_or(col) bit_or(col1, col2)
bit_xor(col) bit_xor(col1, col2)
char_length(str) char_length(col)
character_length(str) character_length(col)
date_add(start, days) date_add(col, days)
date_format(date, format) date_format(col, format)
date_sub(start, days) date_sub(col, days)
date_trunc(format, timestamp) date_trunc(format, col)
elt(inputs) elt(index, cols)
from_csv(col, schema, options='None') from_csv(col)
from_unixtime(timestamp, format="'yyyy-MM-dd HH:mm:ss'") from_unixtime(timestamp, format)
json_array_length(col) json_array_length(col, path)
log(col) log(col, base)
map_concat(cols) map_concat(col1, col2)
named_struct(cols) named_struct(names, columns)
overlay(src, replace, pos, len='-1') overlay(src, replace, pos, len='Ellipsis')
regexp_extract_all(str, regexp, idx='None') regexp_extract_all(str, regexp, idx=0)
schema_of_csv(csv, options='None') schema_of_csv(col)
schema_of_json(json, options='None') schema_of_json(col)
split(str, pattern, limit='-1') split(src, delimiter)
... and 5 more

Missing (PySpark only)

  • aes_decrypt(input, key, mode='None', padding='None', aad='None')
  • aes_encrypt(input, key, mode='None', padding='None', iv='None', aad='None')
  • any_value(col, ignoreNulls='None')
  • approx_count_distinct(col, rsd='None')
  • approx_percentile(col, percentage, accuracy='10000')
  • array_remove(col, element)
  • array_repeat(col, count)
  • bitmap_bit_position(col)
  • bitmap_bucket_number(col)
  • bitmap_construct_agg(col)
  • bitmap_count(col)
  • bitmap_or_agg(col)
  • bool_and(col)
  • bool_or(col)
  • bucket(numBuckets, col)
  • call_function(funcName, cols)
  • call_udf(udfName, cols)
  • collect_list(col)
  • collect_set(col)
  • concat(cols)
  • concat_ws(sep, cols)
  • corr(col1, col2)
  • count_distinct(col, cols)
  • count_if(col)
  • count_min_sketch(col, eps, confidence, seed)
  • covar_pop(col1, col2)
  • covar_samp(col1, col2)
  • crc32(col)
  • cume_dist()
  • datediff(end, start)
  • decode(col, charset)
  • dense_rank()
  • element_at(col, extraction)
  • encode(col, charset)
  • every(col)
  • exists(col, f)
  • exp(col)
  • explode(col)
  • expr(str)
  • filter(col, f)
  • first(col, ignorenulls='False')
  • first_value(col, ignoreNulls='None')
  • flatten(col)
  • floor(col)
  • forall(col, f)
  • from_json(col, schema, options='None')
  • grouping(col)
  • grouping_id(cols)
  • histogram_numeric(col, nBins)
  • hll_sketch_agg(col, lgConfigK='None')
  • ... and 121 more

Extra (robin-sparkless only)

  • bitwiseNOT(col)
  • cast(col, type_name)
  • chr(col)
  • dayname(col)
  • isin(col, other)
  • minutes(n)
  • negate(col)
  • power(col1, col2)
  • shiftLeft(col, numBits)
  • shiftRight(col, numBits)
  • timestampadd(unit, amount, ts)
  • timestampdiff(unit, start, end)
  • try_cast(col, type_name)

Semantic annotations

Items tagged from docs/gap_annotations.json and PYSPARK_DIFFERENCES.md:

stub (no-op or placeholder):

  • broadcast
  • current_catalog
  • current_database
  • current_schema
  • current_user
  • grouping
  • grouping_id
  • input_file_name
  • monotonically_increasing_id
  • spark_partition_id
  • user
  • isStreaming
  • is_streaming
  • persist
  • storageLevel
  • storage_level
  • unpersist
  • withWatermark
  • with_watermark
  • current_catalog
  • current_database

diverges (behavior differs from PySpark):

  • aes_decrypt
  • aes_encrypt
  • assert_true
  • from_unixtime
  • from_utc_timestamp
  • raise_error
  • rand
  • randn
  • to_utc_timestamp
  • try_aes_decrypt
  • unix_timestamp
  • from_unixtime
  • unix_timestamp

deferred (out of scope):

  • call_udf
  • count_min_sketch
  • histogram_numeric
  • hll_sketch_agg
  • hll_sketch_estimate
  • hll_union
  • hll_union_agg
  • sentences
  • session_window
  • udf
  • udtf
  • xpath
  • xpath_boolean
  • xpath_double
  • xpath_float
  • xpath_int
  • xpath_long
  • xpath_number
  • xpath_short
  • xpath_string
  • udf
  • udtf
  • foreach
  • foreachPartition
  • foreach_partition

Parity fixture coverage: see PARITY_STATUS.md.


Regeneration

python scripts/extract_pyspark_api_from_repo.py --clone --branch v3.5.0
python scripts/extract_robin_api_from_source.py  # or use existing signatures_robin_sparkless.json
python scripts/gap_analysis_pyspark_repo.py --write-md docs/GAP_ANALYSIS_PYSPARK_REPO.md