What Robin-Sparkless Does NOT Have (vs Sparkless 3.28.0)¶
This list is only items that Sparkless has and robin-sparkless does not (or has only as a stub/no-op). Implemented equivalents (e.g. signum for sign, avg for mean, trunc for date_trunc) are not listed.
PySpark parity scope: All items below are PySpark APIs (or direct Sparkless equivalents of PySpark), unless marked as Sparkless-specific. PySpark reference: pyspark.sql (SparkSession, DataFrame, Catalog, functions), pyspark.sql.functions, and RDD / SparkContext. Tracking these as unimplemented is for PySpark parity; implementing them would align robin-sparkless with standard PySpark behavior.
For a consolidated view of deferred/optional scope (XML, UDF, streaming, RDD, sketch, Catalog DDL), see DEFERRED_SCOPE.md.
Functions (sparkless.sql.functions) — Missing or stub only¶
Crypto / binary¶
- ~~
aes_decrypt,aes_encrypt,try_aes_decrypt~~ — implemented (AES-128-GCM; see PYSPARK_DIFFERENCES) - ~~
to_binary,try_to_binary~~ — implemented - ~~
decode,encode~~ — implemented (UTF-8, hex) - ~~
octet_length,char_length,character_length~~ — implemented
Approx / aggregates¶
- ~~
approx_percentile~~ — implemented - ~~
percentile_approx~~ — implemented - ~~
covar_pop,covar_samp,corras groupBy aggregations~~ — implemented - ~~
kurtosis,skewness~~ — implemented
Array¶
- ~~
aggregate(array fold)~~ — implemented (simplified: zero + sum(list)) - ~~
cardinality~~ — implemented (alias for size)
Bitmap (PySpark 3.5+)¶
- ~~
bitmap_bit_position,bitmap_bucket_number,bitmap_construct_agg,bitmap_count,bitmap_or_agg~~ — implemented
Datetime / interval¶
- ~~
make_dt_interval,make_ym_interval~~ — implemented - ~~
to_timestamp_ltz,to_timestamp_ntz~~ — implemented (we haveto_timestamp)
JSON / XML / CSV¶
- ~~
json_object_keys,json_tuple~~ — implemented from_xml,to_xml,schema_of_xml(PySpark 3.4+/4.0+; optional/deferred)- ~~
from_csv,to_csv~~ — implemented (minimal) - ~~
schema_of_csv,schema_of_json~~ — implemented (stub: return literal schema string)
Misc / UDF / JVM¶
call_function(stub: not supported; Sparkless-specific — PySpark equivalent: register UDF withspark.udf.registerthen use the name inexpr()or SQL)- ~~
grouping,grouping_id~~ — implemented (stub: return 0) - ~~
inline,inline_outer~~ — implemented (explode list of structs; use unnest for struct fields) - sentences (optional/deferred): PySpark
pyspark.sql.functions.sentences— NLP string→array of array of words; implement only if prioritized. - ~~
sequence~~ — implemented (generate array of numbers) - ~~
sha~~ — we have sha1, sha2 - ~~
shuffle~~ — implemented window,window_time(PySpark; we have.over(); thin wrappers if needed; see PYSPARK_DIFFERENCES)pandas_udf(PySparkpandas_udfdecorator for scalar, grouped map, and other function types; robin-sparkless currently supports only a minimal grouped aggregation variant viapandas_udf(..., function_type="grouped_agg")on the Python side)count_min_sketch,histogram_numeric,hll_sketch_agg,hll_sketch_estimate,hll_union,hll_union_agg(PySpark 3.5+; stub/defer)session_window(PySpark Structured Streaming; stub: no streaming)call_udf,udtf,reduce,reflect,java_method(PySpark JVM/UDTF APIs; stub: not supported)
Regression¶
- ~~
regr_avgx,regr_avgy,regr_count,regr_intercept,regr_r2,regr_slope,regr_sxx,regr_sxy,regr_syy~~ — implemented
XPath (deferred)¶
- XPath (deferred): PySpark
xpath,xpath_boolean,xpath_double, etc. (3.5+) — require XML support; stub or defer (see PYSPARK_DIFFERENCES).
Aliases we don’t expose (Sparkless has, we have equivalent under different name)¶
- ~~
sign~~ — alias of signum - ~~
std~~ — alias of stddev - ~~
mean~~ — alias of avg - ~~
date_trunc~~ — alias of trunc - ~~
regexp~~ — alias of rlike
DataFrame methods — Missing or no-op only¶
corr— We havedf.stat().corr(col1, col2)(scalar) anddf.corr()(correlation matrix DataFrame).- ~~
createGlobalTempView,createOrReplaceGlobalTempView~~ — implemented (stub: same catalog as temp view). createTempView,createOrReplaceTempView— we exposecreate_or_replace_temp_view(SQL feature).- ~~
cube,rollup~~ — implemented (multiple grouping sets then union). - ~~
data~~ — implemented (best-effort: same ascollect(), list of row dicts). - ~~
dtypes~~ — implemented (returns list of (name, dtype_string)). foreach,foreachPartition— PySpark DataFrame; stub: raiseNotImplementedError(see PYSPARK_DIFFERENCES).mapInPandas,mapPartitions— PySpark DataFrame; stub: raiseNotImplementedError.rdd— PySparkDataFrame.rdd; stub: raiseNotImplementedError(usecollect()ortoLocalIterator()for local data).registerTempTable— legacy; we havecreate_or_replace_temp_view.- ~~
repartitionByRange~~ — implemented (no-op). sameSemantics,semanticHash— we have no-op stubs that return fixed values.- ~~
sortWithinPartitions~~ — implemented (no-op). storageLevel— PySpark; stub: returnsNone(DataFrame is lazy; no storage level).to— PySpark generic writer; we havewrite_deltawith delta feature.- ~~
toLocalIterator~~ — implemented (best-effort: same ascollect(), iterable of rows). - ~~
unpersist~~ — we have it (no-op). withWatermark— PySpark Structured Streaming; no-op stub (streaming not supported).- ~~
write~~ — implemented (generic write: parquet/csv/json, mode overwrite/append).writeTo— PySparkDataFrameWriterV2/ catalog table API; stub or use write to path. isStreaming— PySpark; stub: always returnsFalse.
Summary counts (approximate)¶
| Category | Sparkless | Robin-Sparkless missing (approx) |
|---|---|---|
| Functions | 417 | ~55–60 (crypto, XML, regr, bitmap, CSV/JSON schema, UDF/UDTF, etc.) |
| DataFrame methods | 95 | ~20–25 (cube/rollup, RDD/foreach, writeTo, streaming, etc.) |
For full “we have it” lists see GAP_ANALYSIS_SPARKLESS_3.28.md and PARITY_CHECK_SPARKLESS_3.28.md.