Sparkless → Robin-Sparkless Fixture Converter¶

The script tests/convert_sparkless_fixtures.py converts Sparkless expected_outputs JSON to robin-sparkless fixture format. Use it when you have the Sparkless repo or a copy of its tests/expected_outputs/ directory.

Obtaining Sparkless expected_outputs¶

Clone Sparkless (or add as git submodule / CI checkout) so tests/expected_outputs/ is available. Set the environment variable to the exact path to that directory:

export SPARKLESS_EXPECTED_OUTPUTS=/path/to/sparkless/tests/expected_outputs

Example if Sparkless is cloned next to robin-sparkless: export SPARKLESS_EXPECTED_OUTPUTS=../sparkless/tests/expected_outputs

Usage¶

# Convert a single Sparkless JSON file
python tests/convert_sparkless_fixtures.py /path/to/sparkless/tests/expected_outputs/some_test.json tests/fixtures

# Convert all JSON files (optionally skip duplicates of existing tests/fixtures/*.json)
python tests/convert_sparkless_fixtures.py --batch "$SPARKLESS_EXPECTED_OUTPUTS" tests/fixtures --output-subdir converted
python tests/convert_sparkless_fixtures.py --batch "$SPARKLESS_EXPECTED_OUTPUTS" tests/fixtures --output-subdir converted --dedupe

Run parity (convert → regenerate expected from PySpark → run Rust parity):

export SPARKLESS_EXPECTED_OUTPUTS=/path/to/sparkless/tests/expected_outputs
make sparkless-parity

The sparkless-parity target: (1) runs the converter with --dedupe, (2) runs tests/regenerate_expected_from_pyspark.py on tests/fixtures/converted/ to overwrite each fixture’s expected with PySpark’s result, (3) runs cargo test pyspark_parity_fixtures. Ground truth for parity is PySpark.

PySpark for regeneration: pip install pyspark and Java 17 or newer are required. Set JAVA_HOME to a JDK 17+ installation (e.g. export JAVA_HOME=/path/to/jdk-17). If Java is missing or too old, the script prints a clear error and exits; the Makefile skips regeneration when the script fails, so the rest of the pipeline still runs.

Or run parity only (hand-written + tests/fixtures/converted/*.json):

cargo test pyspark_parity_fixtures

Or run phase-specific parity: PARITY_PHASE=a cargo test pyspark_parity_fixtures or make test-parity-phase-a … make test-parity-phase-g. See PARITY_STATUS.md.

Format mapping¶

Sparkless	Robin-sparkless
`input_data` (list of dicts)	`input.schema` + `input.rows` (column-order arrays)
`right_input_data` / `second_input_data`	`right_input` (schema + rows) for join/union
`expected_output.schema` (field_names/field_types or fields)	`expected.schema` (array of `{name, type}`)
`expected_output.data` (list of dicts)	`expected.rows` (column-order arrays)
`operation` (e.g. filter_operations, groupby, join)	`operations` (array of `{op, ...}`)

Operation mapping (supported)¶

filter_operations / filter → { "op": "filter", "expr": "..." }. Use filter_expr in Sparkless JSON if present.
select → { "op": "select", "columns": [...] }
groupby / group_by → { "op": "groupBy", "columns": [...] } + { "op": "agg", "aggregations": [...] }
orderBy / order_by → { "op": "orderBy", "columns": [...], "ascending": [...] }
join → right_input from right_input_data / second_input_data; { "op": "join", "on": [...], "how": "inner"|"left"|"right"|"outer" }. Use join_on / on, join_how / how in Sparkless JSON.
window → { "op": "window", "column": "...", "func": "row_number"|"rank"|"dense_rank"|"lag"|"lead", "partition_by": [...], "order_by": [...], "value_column": "..." }. Use partition_by/partition_cols, order_by/order_cols, window_func/func, value_column, output_column.
withColumn / transformations → { "op": "withColumn", "column": "...", "expr": "..." }. Use with_column_name/column_name, with_column_expr/expr.
union / union_all → { "op": "union" }. For union by name, use union_by_name → { "op": "unionByName" }. Second DataFrame from right_input_data when present.
distinct / drop_duplicates → { "op": "distinct", "subset": [...] }
drop → { "op": "drop", "columns": [...] }. Use columns / drop_columns.
dropna / drop_null → { "op": "dropna", "subset": [...] }
fillna / fill_null → { "op": "fillna", "value": ... }. Use value / fill_value.
limit / head → { "op": "limit", "n": N }. Use n / limit.
withColumnRenamed / rename → { "op": "withColumnRenamed", "existing": "...", "new": "..." }. Use existing/old_name, new/new_name.

Parity discovery and skip¶

Fixtures: Parity test runs all tests/fixtures/*.json and tests/fixtures/converted/*.json (if present). Converted fixtures are written with --output-subdir converted so they do not overwrite hand-written fixtures.
Skip: Add "skip": true and optional "skip_reason": "..." in a fixture JSON to skip it in the parity test (e.g. known unsupported function or semantic difference). See SPARKLESS_PARITY_STATUS.md for recording pass/fail and failure reasons.

Converter status¶

Script: Implemented in tests/convert_sparkless_fixtures.py. Schema and row conversion, right_input for join/union, and operation mapping for filter, select, groupBy, agg, orderBy, join, window, withColumn, union, unionByName, distinct, drop, dropna, fillna, limit, withColumnRenamed are in place.
Target: Convert Sparkless expected_outputs with --batch and --output-subdir converted; run make sparkless-parity (set SPARKLESS_EXPECTED_OUTPUTS when Sparkless repo is available). Goal: 50+ tests passing (hand-written + converted). Current: 159 hand-written fixtures passing (Phase 19: aggregates, try_*, misc; Phase 18: array/map/struct; Phase 17: datetime/unix, pmod, factorial; Phase 16: regexp; Phase 15: aliases, string, math, array_distinct; array_distinct skipped). Target: 160+ total fixtures with converted outputs.
Tests that may fail after conversion: Unsupported expressions, missing functions, or semantic differences. Fix converter where possible; add skip: true and document in SPARKLESS_PARITY_STATUS.md.

Port status¶

When Sparkless repo is available: run make sparkless-parity, then update SPARKLESS_PARITY_STATUS.md with converted/passing/failing/skipped counts and failure reasons. Re-run after Sparkless updates to refresh converted fixtures and expected (PySpark-regenerated).

SPARKLESS_PARITY_STATUS.md – pass/fail counts and failure reasons for converted fixtures
SPARKLESS_INTEGRATION_ANALYSIS.md §4 – fixture format comparison and conversion strategy
TEST_CREATION_GUIDE.md – robin-sparkless fixture format and test flow