Phase H: Deferred / Optional Scope¶

This document lists APIs and features that are explicitly out of scope for robin-sparkless full parity. They are documented for clarity so users know the boundaries and can choose appropriate workarounds.

XML / XPath¶

APIs	Status	Rationale
`from_xml`, `to_xml`, `schema_of_xml`, `xpath`, `xpath_boolean`, `xpath_double`, etc.	Not implemented	Would require an XML parser and feature flag.

Workaround: Preprocess XML externally or use JSON instead (from_json, to_json, get_json_object are supported).

Tracking: GitHub issue #146

UDF / UDTF¶

APIs	Status	Rationale
`spark.udf.register`, `call_udf`, `udf()` (decorator), Rust `register_udf`	Implemented	Scalar UDFs: Python UDFs (row-at-a-time) and Rust UDFs, plus column-wise vectorized Python UDFs via `vectorized=True`. Session-scoped registry; visible to DataFrame API, SQL, and plan interpreter.
`pandas_udf` (vectorized)	Partially implemented	Minimal support for grouped aggregation UDFs via `pandas_udf(..., function_type="grouped_agg")` (one scalar per group, used only in `groupBy().agg(...)`). Other pandas_udf function types (scalar, map, grouped map, etc.) remain deferred; would require broader Polars↔Pandas batching and plan integration.
`udtf` (table functions)	Not implemented	Returns multiple rows per input; out of scope.

Implemented: Python: spark.udf().register(name, f, return_type=None, vectorized=False|True); call_udf(name, *cols); my_udf(col("x")) via returned UserDefinedFunction; grouped vectorized aggregations via pandas_udf(f, return_type, function_type="grouped_agg") used in group_by().agg([...]). Rust: session.register_udf(name, \|cols\| ...). SQL: SELECT my_udf(col) FROM t. Plan: {"udf": "name", "args": [...]} or {"fn": "call_udf", "args": [{"lit": "name"}, ...]}.

Limitations: Python UDFs run eagerly (materialize at UDF boundary). Column-wise vectorized UDFs are supported only in withColumn / select paths. Grouped pandas UDFs are restricted to groupBy().agg(...), cannot be mixed with built-ins in a single agg call, and are not available in SQL or other contexts. Python UDF in WHERE/HAVING not yet supported. See docs/UDF_GUIDE.md.

Tracking: GitHub issue #143 (pandas_udf, udtf remain deferred)

Streaming¶

APIs	Status	Rationale
`withWatermark`, `session_window`, `isStreaming`	No-op / stub	Robin-sparkless has no streaming execution model; DataFrame is lazy until actions run.

Current behavior: isStreaming always returns False; withWatermark is a no-op that returns the DataFrame unchanged.

Workaround: Use batch processing. For streaming-like workflows, process data in batches and write incrementally.

Tracking: GitHub issue #145

Sketch aggregates¶

APIs	Status	Rationale
HyperLogLog (HLL), count-min sketch, `hll_sketch_agg`, `hll_sketch_estimate`, etc.	Not implemented	Optional approximate aggregates; low priority for initial parity.

Workaround: Use exact aggregates (count, count_distinct) or approx_count_distinct where supported.

Tracking: GitHub issue #147

RDD / distributed¶

APIs	Status	Rationale
`rdd`, `foreach`, `foreachPartition`, `mapInPandas`, `mapPartitions`	Stub (raise `NotImplementedError`)	Robin-sparkless is single-process; no RDD or distributed execution. DataFrame is lazy (PySpark-like).
`DataFrame.rdd.flatMap` and other RDD transformations	Not implemented	RDD API is out of scope; use DataFrame/DSL (e.g. `explode`, `array_contains`) instead.

Workaround: Use collect(), toLocalIterator(), or to_pandas() for local access. For row-wise processing, materialize with collect() and iterate in Python/Rust. For flatMap-like behavior, use explode on array columns.

Tracking: GitHub issue #142, #848 (RDD flatMap)

VARIANT (PySpark 4 semi-structured)¶

APIs	Status	Rationale
`VariantType`, `parse_json()`, `cast(..., "variant")`	Implemented	JSON string storage; `collect()` deserializes to dict/list when schema uses `VariantType`.
`variant_get`, `schema_of_variant`, pipe syntax, collations, UDF `VariantVal`	Not implemented	No Polars native VARIANT; requires JVM semi-structured engine.

Workaround: Use parse_json + Python after collect() for nested access; use from_json when a fixed struct schema is known.

Catalog DDL¶

APIs	Status	Rationale
`CREATE TABLE`, `CREATE DATABASE`, `DROP TABLE`, `INSERT INTO`, `writeTo`	Not implemented or stub	No persistent catalog; use file-based storage.

Workaround: Use df.write().format("parquet"|"csv"|"json").mode("overwrite"|"append").save(path) to write to paths. Temp views (createOrReplaceTempView, spark.read.table) are supported for in-session tables.

Tracking: GitHub issue #144

Sparkless adapter "not implemented" codes (#764, #765)¶

When the Sparkless adapter reports "2 is not implemented" or "4 is not implemented for the Robin backend", the number is an adapter-specific enum or method id. Robin does not define or interpret these codes. Fix or extend the Sparkless adapter to map the requested operation to a Robin plan/API that is supported (e.g. window functions, group_by with expr). See PYSPARK_DIFFERENCES.md for supported operations.

JVM / runtime stubs: broadcast, spark_partition_id, input_file_name, monotonically_increasing_id, current_catalog, current_user — implemented as stubs for API compatibility; see PYSPARK_DIFFERENCES.md.
sentences / NLP: Deferred; could be implemented as string split + list of lists (#148).