Skip to content

Test failure categories

Historical only (February 2026). The main suite is green as of May–June 2026 (3115 passed, 64 skipped, 0 failed). For current status use TEST_FAILURE_CHECKLIST.md.

Summary of ~605 failing tests (from pytest tests/ -n 10 with package installed via maturin in .venv), grouped by cause/area.


1. Missing native API / bindings (≈33+ failures)

Error pattern: AttributeError: module 'sparkless._native' has no attribute '...'

Missing API Affected area Test files (examples)
arrays_overlap Array contains / join test_array_contains_join_parity.py, test_issue_331_array_contains_join.py

Action: Implement and expose arrays_overlap (and any other missing array/join helpers) in the Rust extension and Python bindings.


2. SQL: SHOW / DESCRIBE not supported (≈7 failures)

Error pattern: SQL: only SELECT, CREATE SCHEMA/DATABASE, DROP TABLE/VIEW/SCHEMA/DATABASE, and DESCRIBE are supported

Unsupported Tests
SHOW DATABASES test_show_describe.py::test_show_databases
SHOW TABLES test_show_tables, test_show_tables_in_database
DESCRIBE EXTENDED (parsed as table name) test_describe_extended

Action: Either extend SQL parser/execution to support these SHOW/DESCRIBE forms, or mark/skip these tests for robin backend.


3. Schema / createDataFrame / DDL (≈80+ failures)

Error patterns:
TypeError: 'list' object is not callable, KeyError: 'name', KeyError: 'a',
AssertionError: expected TypeError, schema type mismatches (IntegerType vs LongType),
TypeError: row 0: expected dict, list, or tuple

Subcategory Cause Test files (examples)
DDL schema parsing DDL string parsing or schema object construction differs test_issue_372_create_data_frame.py, test_issue_418_nested_ddl.py
Infer schema parity Infer schema result (column types/order) differs from expected test_inferschema_parity.py (44 failures)
Single-type createDataFrame Row type / schema handling for single-type DataFrames test_issue_213_createDataFrame_with_single_type.py
Pandas/schema type IntegerType vs LongType when creating from pandas with schema test_issues_225_231.py::TestIssue229PandasDataFrameSupport

Action: Align DDL parsing, createDataFrame (list/dict/Row, schema, pandas) and inferred schema with PySpark/upstream expectations; unify IntegerType vs LongType where required.


4. Type / cast API and semantics (≈20+ failures)

Error patterns:
AttributeError: 'IntegerType' object has no attribute 'simpleString',
_native.SparklessError: conversion from str to i32/i64 failed,
assertions on cast result (e.g. null vs error for invalid string)

Subcategory Cause Test files (examples)
DataType.simpleString Missing or different on IntegerType, LongType, StringType, etc. test_issue_394_cast_data_type.py
String→int cast Strict conversion (error instead of null for bad/invalid strings) test_issue_217_string_to_int_cast.py, upstream cast tests
Cast/alias select Result value or column name after cast test_cast_alias_select_parity.py

Action: Add or fix simpleString() on all DataType subclasses; decide and implement string→int semantics (null vs error) to match PySpark/upstream.


5. Join semantics and API (≈30+ failures)

Error patterns:
AssertionError: DataFrames are not equivalent,
AssertionError: assert [(1, 10, 1, 20)] == [{'id': 1, 'v': 10, 'w': 20}] (row shape/keys),
column name / duplicate name errors

Subcategory Cause Test files (examples)
Outer / semi / anti / cross Join type semantics or result schema/rows test_join.py::test_outer_join, test_semi_join, test_anti_join, test_cross_join
Join on Column Accept column expression in join condition; result columns test_issue_353_join_on_accept_column.py
Join column names / aliases Duplicate or mismatched names after join test_issue_374_join_aliased_columns.py, test_issue_421_join_column_names.py
Left semi LeftSemi join behavior test_issue_438_leftsemi_join.py

Action: Align join types (outer/semi/anti/cross) and “join on Column” behavior with PySpark; fix column naming/aliasing in join plans.


6. NA / replace API and semantics (≈20+ failures)

Error patterns:
TypeError: replace() missing 1 required positional argument: 'value',
TypeError: PyColumn.replace() missing 1 required positional argument: 'replacement',
_native.SparklessError: cannot compare string with numeric type,
assertions on fill/replace result (e.g. assert '1' == 1)

Subcategory Cause Test files (examples)
DataFrameNaFunctions.replace Signature or behavior (to_replace + value) test_issue_360_na_replace.py, test_issue_287_na_replace.py, test_issue_379_column_replace_dict_list.py
eqNullSafe / null comparison Type coercion or comparison with null/numeric test_issue_248_column_eq_null_safe.py, test_issue_260_eq_null_safe.py
Fill/NA doc example Default or fill value (e.g. assert None == 0) test_doc_examples.py::test_user_guide_na_fill_drop

Action: Implement or fix DataFrameNaFunctions.replace and Column replace signature/behavior; align eqNullSafe and NA fill with expected types and values.


7. Order by / sort (≈15+ failures)

Error patterns:
assert [1, 2, 3] == [3, 2, 1],
TypeError: orderBy() expects column names as str or Column/SortOrder expressions

Subcategory Cause Test files (examples)
ascending=False Single-column or list sort with ascending=False not applied test_issue_378_order_by_ascending_bool.py, test_issue_327_orderby_ascending.py
orderBy(list) orderBy with list of columns test_issue_335_window_orderby_list.py, test_issue_415_orderby_list.py, parity window tests

Action: Ensure orderBy accepts Column/SortOrder and boolean ascending; support list of columns and parity with PySpark order.


8. Aggregates / groupby (≈25+ failures)

Error patterns:
KeyError: 'avg(Value)', KeyError: 'approx_count_distinct(value)',
_native.SparklessError: duplicate: column with name 'count' has more than one occurrence,
AssertionError: assert ('max(salary)' in ... or 'max_salary' in ...)

Subcategory Cause Test files (examples)
Aggregate column naming Alias for agg (e.g. max(salary) vs max_salary) test_first_method.py::test_first_after_groupby_agg, test_issue_397_groupby_alias.py
approx_count_distinct Missing or different API/semantics test_approx_count_distinct_rsd.py, parity
Duplicate column names Multiple aggs producing same default name test_column_case_variations.py, test_issue_286_aggregate_function_arithmetic.py
sum/mean on string column Error vs null/behavior test_issue_393_sum_string_column.py, test_issue_437_mean_string_column.py

Action: Align aggregate expression naming and aliasing; implement or fix approx_count_distinct; avoid duplicate output names in groupby/agg.


9. String / array / JSON functions (≈50+ failures)

Error patterns:
AssertionError: DataFrames are not equivalent (string/array results),
AssertionError: assert 'a' == '', assert 7148569436472236994 == 8557436188178888239 (xxhash64),
_native.SparklessError: user error: list.eval operation not supported for dtype str

Subcategory Cause Test files (examples)
Levenshtein / xxhash64 / get_json_object / json_tuple Result value or null handling test_string.py (parity), test_issue_189_string_functions_robust.py
substring_index / regexp_extract Edge cases or multiple matches test_issue_189_string_functions_robust.py
split limit Split with limit parameter test_issue_328_split_limit.py, split_limit_parity
Array contains / explode / posexplode Alias, lengths, or result schema test_issue_293_explode_withcolumn.py, test_issue_366_alias_posexplode.py, test_issue_429_posexplode_no_alias.py, test_issue_430_posexplode_alias_execution.py
Array type / map type Unsupported map/struct in array or JSON test_array_type_robust.py, test_issue_339_column_subscript.py, test_issue_441_map_column_subscript.py

Action: Align string/array/JSON function results and null handling with PySpark; fix split limit, explode/posexplode aliasing and length handling; extend support for map/struct in arrays where needed.


10. Struct / withField / getField (≈25+ failures)

Error patterns:
_native.SparklessError: field not found: E1, field not found: k, field not found: Value,
AssertionError: assert None == 1, Expected struct for nested access 'StructValue'

Subcategory Cause Test files (examples)
Struct field alias Alias after struct field selection test_issue_330_struct_field_alias.py, test_struct_field_alias_parity.py
withField Struct update / withField semantics test_withfield.py, test_issue_398_withfield_window.py
getField / subscript Nested field access on struct/column test_issue_358_getfield.py, test_issue_339_column_subscript.py

Action: Align struct field resolution (names/case), alias propagation, and withField/getField behavior with PySpark.


11. Window functions (≈25+ failures)

Error patterns:
assert 100 == 90, _native.SparklessError: duplicate: ... output name,
_native.SparklessError: expected leading integer in the duration string, found 'm' (date_trunc)

Subcategory Cause Test files (examples)
Window orderBy list orderBy as list in window spec test_window_orderby_list_parity.py, test_issue_335_window_orderby_list.py
date_trunc Duration string format (e.g. '1 month' vs 'm') test_date_trunc_robust.py, test_date_trunc_polars_backend.py
Row number / rank over descending Order or frame test_issue_414_row_number_over_descending.py
Window + arithmetic / withField Combined window and struct/arithmetic test_issue_398_withfield_window.py, test_window_arithmetic.py

Action: Support window orderBy as list; align date_trunc duration parsing; fix window frame/ordering and duplicate column names in window plans.


12. Pow / coalesce / head / SparkContext / union(DataFrame-like) (≈15+ failures)

Error patterns:
TypeError: unsupported operand type(s) for ** or pow(): 'builtins.PyColumn' and 'int',
AttributeError: collect, AttributeError: 'list' object has no attribute 'collect',
TypeError: 'str' object is not callable,
TypeError: argument 'other': '...Wrapper' object cannot be converted to 'PyDataFrame'

Subcategory Cause Test files (examples)
pow / ** Column ** int not implemented or delegated to wrong path test_issue_405_pow_bitwise.py
coalesce Return type (e.g. int vs string) or variadic behavior test_issue_407_coalesce_variadic.py
head() DataFrame.head() returns list instead of DataFrame or missing .collect() test_issue_413_head.py
SparkContext.version API (property vs callable) test_issue_387_spark_context.py, test_sparkcontext_validation.py
union(DataFrame-like) Accept duck-typed “DataFrame-like” in union test_issue_385_union_dataframe_like.py

Action: Implement Column ** int (or route to correct native pow); fix coalesce result types; implement head() and SparkContext.version; consider accepting DataFrame-like in union or document limitation.


13. Error message wording (≈4 failures)

Error pattern: Tests expect substring 'cannot resolve' in error message; robin emits 'not found: column ...'.

Tests Expectation
test_issue_158_dropped_column_error.py Message contains 'cannot resolve'

Action: Either add 'cannot resolve' (or equivalent) to error text for “column not found” cases, or relax test to accept current message.


14. Misc (single-file or few failures each)

  • Format string / null: test_format_string_parity.py (null formatting).
  • DML / INSERT: test_dml.py (if INSERT not supported).
  • CTE / self-join: Duplicate output name in CTE (test_sql_cte_robust.py).
  • Delta / schema evolution: test_delta_lake_schema_evolution.py.
  • Notebooks: test_notebooks.py.
  • Substr alias: test_issue_200_substr_alias.py (list not callable).
  • Array literal/collect: test_issue_256_create_dataframe_array_column.py (e.g. '[1,2]' vs [1, 2]).
  • unionByName diamond: test_issue_355.py (column type/doubled value).
  • First/ignore nulls: test_first_ignorenulls.py.
  • Pandas column order: test_issue_372_pandas_column_order.py.
  • Select with list/tuple: Schema or column order after select([...]) (test_issue_202_select_with_list.py).

Counts by test location (top 15)

Count Path (prefix)
44 tests/upstream_sparkless/tests/unit/dataframe/test_inferschema_parity.py
30 tests/upstream_sparkless/tests/test_issue_331_array_contains_join.py
29 tests/upstream_sparkless/tests/test_issue_339_column_subscript.py
28 tests/upstream_sparkless/tests/test_issue_293_explode_withcolumn.py
26 tests/upstream_sparkless/tests/test_issue_295_withColumnRenamed_nonexistent.py
17 tests/upstream_sparkless/tests/test_issue_434_ltrim_rtrim_in_expr.py
15 tests/upstream_sparkless/tests/test_issue_330_struct_field_alias.py
13 tests/upstream_sparkless/tests/test_issue_413_union_createDataFrame.py
13 tests/upstream_sparkless/tests/test_issue_328_split_limit.py
11 tests/upstream_sparkless/tests/test_issue_441_map_column_subscript.py
11 tests/upstream_sparkless/tests/test_issue_437_mean_string_column.py
10 tests/upstream_sparkless/tests/test_issue_397_groupby_alias.py
10 tests/upstream_sparkless/tests/test_issue_393_sum_string_column.py
10 tests/upstream_sparkless/tests/test_issue_366_alias_posexplode.py
10 tests/upstream_sparkless/tests/parity/functions/test_array.py

Suggested priority

  1. High impact / many failures: infer schema parity (44), array_contains_join / arrays_overlap (30+), column subscript (29), explode/withColumn (28), withColumnRenamed (26).
  2. API gaps: SHOW/DESCRIBE, replace/NA, orderBy(ascending=False), head(), pow(Column, int), approx_count_distinct, DataFrame-like union.
  3. Type/schema: DataType.simpleString, IntegerType vs LongType, createDataFrame DDL and row types.
  4. Join/aggregate naming: Join result shape/names, aggregate aliases, duplicate column names.
  5. String/array/struct: Levenshtein, xxhash64, split limit, struct field alias, withField, date_trunc.