Test failure categories¶
Historical only (February 2026). The main suite is green as of May–June 2026 (3115 passed, 64 skipped, 0 failed). For current status use TEST_FAILURE_CHECKLIST.md.
Summary of ~605 failing tests (from pytest tests/ -n 10 with package installed via maturin in .venv), grouped by cause/area.
1. Missing native API / bindings (≈33+ failures)¶
Error pattern: AttributeError: module 'sparkless._native' has no attribute '...'
| Missing API | Affected area | Test files (examples) |
|---|---|---|
arrays_overlap |
Array contains / join | test_array_contains_join_parity.py, test_issue_331_array_contains_join.py |
Action: Implement and expose arrays_overlap (and any other missing array/join helpers) in the Rust extension and Python bindings.
2. SQL: SHOW / DESCRIBE not supported (≈7 failures)¶
Error pattern: SQL: only SELECT, CREATE SCHEMA/DATABASE, DROP TABLE/VIEW/SCHEMA/DATABASE, and DESCRIBE are supported
| Unsupported | Tests |
|---|---|
SHOW DATABASES |
test_show_describe.py::test_show_databases |
SHOW TABLES |
test_show_tables, test_show_tables_in_database |
DESCRIBE EXTENDED (parsed as table name) |
test_describe_extended |
Action: Either extend SQL parser/execution to support these SHOW/DESCRIBE forms, or mark/skip these tests for robin backend.
3. Schema / createDataFrame / DDL (≈80+ failures)¶
Error patterns:
TypeError: 'list' object is not callable, KeyError: 'name', KeyError: 'a',
AssertionError: expected TypeError, schema type mismatches (IntegerType vs LongType),
TypeError: row 0: expected dict, list, or tuple
| Subcategory | Cause | Test files (examples) |
|---|---|---|
| DDL schema parsing | DDL string parsing or schema object construction differs | test_issue_372_create_data_frame.py, test_issue_418_nested_ddl.py |
| Infer schema parity | Infer schema result (column types/order) differs from expected | test_inferschema_parity.py (44 failures) |
| Single-type createDataFrame | Row type / schema handling for single-type DataFrames | test_issue_213_createDataFrame_with_single_type.py |
| Pandas/schema type | IntegerType vs LongType when creating from pandas with schema | test_issues_225_231.py::TestIssue229PandasDataFrameSupport |
Action: Align DDL parsing, createDataFrame (list/dict/Row, schema, pandas) and inferred schema with PySpark/upstream expectations; unify IntegerType vs LongType where required.
4. Type / cast API and semantics (≈20+ failures)¶
Error patterns:
AttributeError: 'IntegerType' object has no attribute 'simpleString',
_native.SparklessError: conversion from str to i32/i64 failed,
assertions on cast result (e.g. null vs error for invalid string)
| Subcategory | Cause | Test files (examples) |
|---|---|---|
| DataType.simpleString | Missing or different on IntegerType, LongType, StringType, etc. | test_issue_394_cast_data_type.py |
| String→int cast | Strict conversion (error instead of null for bad/invalid strings) | test_issue_217_string_to_int_cast.py, upstream cast tests |
| Cast/alias select | Result value or column name after cast | test_cast_alias_select_parity.py |
Action: Add or fix simpleString() on all DataType subclasses; decide and implement string→int semantics (null vs error) to match PySpark/upstream.
5. Join semantics and API (≈30+ failures)¶
Error patterns:
AssertionError: DataFrames are not equivalent,
AssertionError: assert [(1, 10, 1, 20)] == [{'id': 1, 'v': 10, 'w': 20}] (row shape/keys),
column name / duplicate name errors
| Subcategory | Cause | Test files (examples) |
|---|---|---|
| Outer / semi / anti / cross | Join type semantics or result schema/rows | test_join.py::test_outer_join, test_semi_join, test_anti_join, test_cross_join |
| Join on Column | Accept column expression in join condition; result columns | test_issue_353_join_on_accept_column.py |
| Join column names / aliases | Duplicate or mismatched names after join | test_issue_374_join_aliased_columns.py, test_issue_421_join_column_names.py |
| Left semi | LeftSemi join behavior | test_issue_438_leftsemi_join.py |
Action: Align join types (outer/semi/anti/cross) and “join on Column” behavior with PySpark; fix column naming/aliasing in join plans.
6. NA / replace API and semantics (≈20+ failures)¶
Error patterns:
TypeError: replace() missing 1 required positional argument: 'value',
TypeError: PyColumn.replace() missing 1 required positional argument: 'replacement',
_native.SparklessError: cannot compare string with numeric type,
assertions on fill/replace result (e.g. assert '1' == 1)
| Subcategory | Cause | Test files (examples) |
|---|---|---|
| DataFrameNaFunctions.replace | Signature or behavior (to_replace + value) | test_issue_360_na_replace.py, test_issue_287_na_replace.py, test_issue_379_column_replace_dict_list.py |
| eqNullSafe / null comparison | Type coercion or comparison with null/numeric | test_issue_248_column_eq_null_safe.py, test_issue_260_eq_null_safe.py |
| Fill/NA doc example | Default or fill value (e.g. assert None == 0) |
test_doc_examples.py::test_user_guide_na_fill_drop |
Action: Implement or fix DataFrameNaFunctions.replace and Column replace signature/behavior; align eqNullSafe and NA fill with expected types and values.
7. Order by / sort (≈15+ failures)¶
Error patterns:
assert [1, 2, 3] == [3, 2, 1],
TypeError: orderBy() expects column names as str or Column/SortOrder expressions
| Subcategory | Cause | Test files (examples) |
|---|---|---|
| ascending=False | Single-column or list sort with ascending=False not applied |
test_issue_378_order_by_ascending_bool.py, test_issue_327_orderby_ascending.py |
| orderBy(list) | orderBy with list of columns | test_issue_335_window_orderby_list.py, test_issue_415_orderby_list.py, parity window tests |
Action: Ensure orderBy accepts Column/SortOrder and boolean ascending; support list of columns and parity with PySpark order.
8. Aggregates / groupby (≈25+ failures)¶
Error patterns:
KeyError: 'avg(Value)', KeyError: 'approx_count_distinct(value)',
_native.SparklessError: duplicate: column with name 'count' has more than one occurrence,
AssertionError: assert ('max(salary)' in ... or 'max_salary' in ...)
| Subcategory | Cause | Test files (examples) |
|---|---|---|
| Aggregate column naming | Alias for agg (e.g. max(salary) vs max_salary) |
test_first_method.py::test_first_after_groupby_agg, test_issue_397_groupby_alias.py |
| approx_count_distinct | Missing or different API/semantics | test_approx_count_distinct_rsd.py, parity |
| Duplicate column names | Multiple aggs producing same default name | test_column_case_variations.py, test_issue_286_aggregate_function_arithmetic.py |
| sum/mean on string column | Error vs null/behavior | test_issue_393_sum_string_column.py, test_issue_437_mean_string_column.py |
Action: Align aggregate expression naming and aliasing; implement or fix approx_count_distinct; avoid duplicate output names in groupby/agg.
9. String / array / JSON functions (≈50+ failures)¶
Error patterns:
AssertionError: DataFrames are not equivalent (string/array results),
AssertionError: assert 'a' == '', assert 7148569436472236994 == 8557436188178888239 (xxhash64),
_native.SparklessError: user error: list.eval operation not supported for dtype str
| Subcategory | Cause | Test files (examples) |
|---|---|---|
| Levenshtein / xxhash64 / get_json_object / json_tuple | Result value or null handling | test_string.py (parity), test_issue_189_string_functions_robust.py |
| substring_index / regexp_extract | Edge cases or multiple matches | test_issue_189_string_functions_robust.py |
| split limit | Split with limit parameter | test_issue_328_split_limit.py, split_limit_parity |
| Array contains / explode / posexplode | Alias, lengths, or result schema | test_issue_293_explode_withcolumn.py, test_issue_366_alias_posexplode.py, test_issue_429_posexplode_no_alias.py, test_issue_430_posexplode_alias_execution.py |
| Array type / map type | Unsupported map/struct in array or JSON | test_array_type_robust.py, test_issue_339_column_subscript.py, test_issue_441_map_column_subscript.py |
Action: Align string/array/JSON function results and null handling with PySpark; fix split limit, explode/posexplode aliasing and length handling; extend support for map/struct in arrays where needed.
10. Struct / withField / getField (≈25+ failures)¶
Error patterns:
_native.SparklessError: field not found: E1, field not found: k, field not found: Value,
AssertionError: assert None == 1, Expected struct for nested access 'StructValue'
| Subcategory | Cause | Test files (examples) |
|---|---|---|
| Struct field alias | Alias after struct field selection | test_issue_330_struct_field_alias.py, test_struct_field_alias_parity.py |
| withField | Struct update / withField semantics | test_withfield.py, test_issue_398_withfield_window.py |
| getField / subscript | Nested field access on struct/column | test_issue_358_getfield.py, test_issue_339_column_subscript.py |
Action: Align struct field resolution (names/case), alias propagation, and withField/getField behavior with PySpark.
11. Window functions (≈25+ failures)¶
Error patterns:
assert 100 == 90, _native.SparklessError: duplicate: ... output name,
_native.SparklessError: expected leading integer in the duration string, found 'm' (date_trunc)
| Subcategory | Cause | Test files (examples) |
|---|---|---|
| Window orderBy list | orderBy as list in window spec | test_window_orderby_list_parity.py, test_issue_335_window_orderby_list.py |
| date_trunc | Duration string format (e.g. '1 month' vs 'm') | test_date_trunc_robust.py, test_date_trunc_polars_backend.py |
| Row number / rank over descending | Order or frame | test_issue_414_row_number_over_descending.py |
| Window + arithmetic / withField | Combined window and struct/arithmetic | test_issue_398_withfield_window.py, test_window_arithmetic.py |
Action: Support window orderBy as list; align date_trunc duration parsing; fix window frame/ordering and duplicate column names in window plans.
12. Pow / coalesce / head / SparkContext / union(DataFrame-like) (≈15+ failures)¶
Error patterns:
TypeError: unsupported operand type(s) for ** or pow(): 'builtins.PyColumn' and 'int',
AttributeError: collect, AttributeError: 'list' object has no attribute 'collect',
TypeError: 'str' object is not callable,
TypeError: argument 'other': '...Wrapper' object cannot be converted to 'PyDataFrame'
| Subcategory | Cause | Test files (examples) |
|---|---|---|
| pow / ** | Column ** int not implemented or delegated to wrong path | test_issue_405_pow_bitwise.py |
| coalesce | Return type (e.g. int vs string) or variadic behavior | test_issue_407_coalesce_variadic.py |
| head() | DataFrame.head() returns list instead of DataFrame or missing .collect() | test_issue_413_head.py |
| SparkContext.version | API (property vs callable) | test_issue_387_spark_context.py, test_sparkcontext_validation.py |
| union(DataFrame-like) | Accept duck-typed “DataFrame-like” in union | test_issue_385_union_dataframe_like.py |
Action: Implement Column ** int (or route to correct native pow); fix coalesce result types; implement head() and SparkContext.version; consider accepting DataFrame-like in union or document limitation.
13. Error message wording (≈4 failures)¶
Error pattern: Tests expect substring 'cannot resolve' in error message; robin emits 'not found: column ...'.
| Tests | Expectation |
|---|---|
test_issue_158_dropped_column_error.py |
Message contains 'cannot resolve' |
Action: Either add 'cannot resolve' (or equivalent) to error text for “column not found” cases, or relax test to accept current message.
14. Misc (single-file or few failures each)¶
- Format string / null:
test_format_string_parity.py(null formatting). - DML / INSERT:
test_dml.py(if INSERT not supported). - CTE / self-join: Duplicate output name in CTE (
test_sql_cte_robust.py). - Delta / schema evolution:
test_delta_lake_schema_evolution.py. - Notebooks:
test_notebooks.py. - Substr alias:
test_issue_200_substr_alias.py(list not callable). - Array literal/collect:
test_issue_256_create_dataframe_array_column.py(e.g.'[1,2]'vs[1, 2]). - unionByName diamond:
test_issue_355.py(column type/doubled value). - First/ignore nulls:
test_first_ignorenulls.py. - Pandas column order:
test_issue_372_pandas_column_order.py. - Select with list/tuple: Schema or column order after
select([...])(test_issue_202_select_with_list.py).
Counts by test location (top 15)¶
| Count | Path (prefix) |
|---|---|
| 44 | tests/upstream_sparkless/tests/unit/dataframe/test_inferschema_parity.py |
| 30 | tests/upstream_sparkless/tests/test_issue_331_array_contains_join.py |
| 29 | tests/upstream_sparkless/tests/test_issue_339_column_subscript.py |
| 28 | tests/upstream_sparkless/tests/test_issue_293_explode_withcolumn.py |
| 26 | tests/upstream_sparkless/tests/test_issue_295_withColumnRenamed_nonexistent.py |
| 17 | tests/upstream_sparkless/tests/test_issue_434_ltrim_rtrim_in_expr.py |
| 15 | tests/upstream_sparkless/tests/test_issue_330_struct_field_alias.py |
| 13 | tests/upstream_sparkless/tests/test_issue_413_union_createDataFrame.py |
| 13 | tests/upstream_sparkless/tests/test_issue_328_split_limit.py |
| 11 | tests/upstream_sparkless/tests/test_issue_441_map_column_subscript.py |
| 11 | tests/upstream_sparkless/tests/test_issue_437_mean_string_column.py |
| 10 | tests/upstream_sparkless/tests/test_issue_397_groupby_alias.py |
| 10 | tests/upstream_sparkless/tests/test_issue_393_sum_string_column.py |
| 10 | tests/upstream_sparkless/tests/test_issue_366_alias_posexplode.py |
| 10 | tests/upstream_sparkless/tests/parity/functions/test_array.py |
Suggested priority¶
- High impact / many failures: infer schema parity (44), array_contains_join / arrays_overlap (30+), column subscript (29), explode/withColumn (28), withColumnRenamed (26).
- API gaps: SHOW/DESCRIBE, replace/NA, orderBy(ascending=False), head(), pow(Column, int), approx_count_distinct, DataFrame-like union.
- Type/schema: DataType.simpleString, IntegerType vs LongType, createDataFrame DDL and row types.
- Join/aggregate naming: Join result shape/names, aggregate aliases, duplicate column names.
- String/array/struct: Levenshtein, xxhash64, split limit, struct field alias, withField, date_trunc.