Sparkless Python Test Port Tracker¶
This doc tracks which Sparkless Python tests have been ported to run against robin_sparkless in this repo, which were skipped as duplicates of existing fixtures, and which are deferred (unsupported API or low priority).
Ground truth: Expected results for ported tests come from PySpark (not Sparkless). Use the spark fixture from tests/conftest.py and assert_rows_equal() from tests/utils.py in ported tests.
Why 45 Python tests (and fixture-driven parity)?¶
Sparkless has hundreds of tests (313+ test files, many test methods in parity/dataframe/, parity/functions/, parity/sql/, etc.). Python tests live under tests/ (e.g. tests/dataframe/, tests/functions/); many scenarios are covered by fixture-driven parity (Rust) instead. Rationale:
- No duplicates – We skip any scenario already covered by (a) hand-written fixtures in
tests/fixtures/*.json, (b) 226 converted fixtures intests/fixtures/converted/(from Sparklessexpected_outputs), or (c)test_robin_sparkless.py. So e.g. filter (age > 30) and group_by count were not ported again. - Fixture conversion covers many scenarios – The 226 converted JSON fixtures are run as Rust parity tests (
cargo test pyspark_parity_fixtures). Those are “ported” as fixture-driven parity, not as separate Python test functions. The 45 Python tests add coverage with inlined or predetermined expectations (no PySpark at test runtime). - API gaps – Many Sparkless tests depend on PySpark APIs Robin does not fully implement: SQL (
spark.sql, DDL/DML, subqueries, CTEs, HAVING), RDD (df.rdd,foreach,foreachPartition,mapInPandas,mapPartitions), advanced UDFs (fullpandas_udfdecorator surface and UDTF; Robin currently supports scalar and column-wise vectorized UDFs viaspark.udf().registerplus grouped aggregation viapandas_udf(..., function_type="grouped_agg")only ingroupBy().agg(...)), catalog (databases, tables,writeTo), streaming (withWatermark,isStreaming), and built-in functions (XML/XPath, sentences, sketch functions, etc.). All of these are standard PySpark APIs; see ROBIN_SPARKLESS_MISSING.md (scoped to PySpark parity). Those tests are deferred until Robin supports them. - Initial batch – The port focused on a small set of DataFrame ops Robin fully supports (filter, select, join, groupBy+agg) that were not duplicate coverage. Expanding the port means adding more Python tests for supported ops and/or implementing more APIs and then porting the corresponding tests.
Ported¶
| Sparkless test / scenario | Robin test / location | Notes |
|---|---|---|
| test_filter_with_boolean (filter salary > 60000) | test_dataframe_parity.py::test_filter_salary_gt_60000 | |
| test_filter_with_and_operator | test_dataframe_parity.py::test_filter_and_operator | |
| test_filter_with_or_operator | test_dataframe_parity.py::test_filter_or_operator | |
| test_basic_select | test_dataframe_parity.py::test_basic_select | |
| test_select_with_alias | test_dataframe_parity.py::test_select_with_alias | via with_column + select |
| test_aggregation (groupBy + avg, count) | test_dataframe_parity.py::test_aggregation_avg_count | |
| test_inner_join | test_dataframe_parity.py::test_inner_join |
Skipped (duplicate of existing coverage)¶
| Sparkless test / scenario | Existing coverage | Notes |
|---|---|---|
| test_filter_operations (age > 30) | fixture filter_age_gt_30, test_filter_and_select | Same scenario. |
| test_group_by (department count) | test_robin_sparkless.py::test_group_by_count, fixture groupby_count | Same scenario. |
Deferred¶
| Sparkless test / scenario | Reason |
|---|---|
| (unsupported API, RDD, streaming, etc.) | List here when a candidate is not ported due to missing API or low priority. |
How to port (no duplicates)¶
- Discover Sparkless tests under
tests/(e.g.parity/dataframe/,parity/functions/,unit/) that use only APIs Robin implements (see plan: filter, select, groupBy, join, window, etc.). - Deduplicate: For each candidate, check (1) whether the same scenario is already covered by a parity fixture in
tests/fixtures/*.jsonortests/fixtures/converted/*.json, and (2) whethertests/python/test_robin_sparkless.pyalready has an equivalent test. If yes, add to "Skipped" above and do not port. - Port: Add a new test under
tests/(e.g.tests/dataframe/,tests/functions/) using thesparkfixture and expected from PySpark. Useassert_rows_equal(actual, expected, order_matters=...)fromtests/utils.py. - Update this table: Add the ported test under "Ported" or the skipped one under "Skipped (duplicate)".
Related¶
- CONVERTER_STATUS.md – fixture conversion and dedupe
- SPARKLESS_PARITY_STATUS.md – fixture parity results
- TEST_CREATION_GUIDE.md – fixture format and PySpark as oracle