Ported Test Expectations¶
Expected outputs for the ported Sparkless parity tests in tests/dataframe/test_dataframe_parity.py. Ground truth is PySpark; these were derived from Sparkless expected_outputs / PySpark behavior.
test_filter_salary_gt_60000¶
Input: INPUT_EMPLOYEES (4 rows: id, name, age, salary, department).
Op: filter(col("salary").gt(lit(60000))).
Expected rows (order irrelevant):
[
{"age": 35, "department": "IT", "id": 3, "name": "Charlie", "salary": 70000},
{"age": 40, "department": "Finance", "id": 4, "name": "David", "salary": 80000}
]
test_filter_and_operator¶
Input: [{"a":1,"b":2}, {"a":2,"b":3}, {"a":3,"b":1}].
Op: filter((col("a") > 1) & (col("b") > 1)).
Expected: 1 row: {"a": 2, "b": 3}.
test_filter_or_operator¶
Input: [{"a":1,"b":2}, {"a":2,"b":3}, {"a":3,"b":1}].
Op: filter((col("a") > 1) | (col("b") > 1)).
Expected: 3 rows (all rows).
test_basic_select¶
Input: INPUT_EMPLOYEES.
Op: select(["id", "name", "age"]).
Expected rows (order matters):
[
{"id": 1, "name": "Alice", "age": 25},
{"id": 2, "name": "Bob", "age": 30},
{"id": 3, "name": "Charlie", "age": 35},
{"id": 4, "name": "David", "age": 40}
]
test_select_with_alias¶
Input: INPUT_EMPLOYEES.
Op: with_column("user_id", col("id")).with_column("full_name", col("name")).select(["user_id", "full_name"]).
Expected rows (order matters):
[
{"user_id": 1, "full_name": "Alice"},
{"user_id": 2, "full_name": "Bob"},
{"user_id": 3, "full_name": "Charlie"},
{"user_id": 4, "full_name": "David"}
]
test_aggregation_avg_count¶
Input: INPUT_EMPLOYEES.
Op: group_by(["department"]).agg([avg(salary).alias("avg_salary"), count(id).alias("count")]).
Expected rows (order irrelevant):
[
{"department": "Finance", "avg_salary": 80000.0, "count": 1},
{"department": "HR", "avg_salary": 60000.0, "count": 1},
{"department": "IT", "avg_salary": 60000.0, "count": 2}
]
test_inner_join¶
Input: Employees (id, name, dept_id, salary) and Departments (dept_id, name, location).
Op: emp_df.join(dept_df, ["dept_id"], "inner").
Expected: 3 rows (dept_id 10 × 2, dept_id 20 × 1). Key checks:
- id 1: dept_id 10, salary 50000
- id 2: dept_id 20
- id 3: dept_id 10, salary 70000
Shared input: INPUT_EMPLOYEES¶
[
{"id": 1, "name": "Alice", "age": 25, "salary": 50000, "department": "IT"},
{"id": 2, "name": "Bob", "age": 30, "salary": 60000, "department": "HR"},
{"id": 3, "name": "Charlie", "age": 35, "salary": 70000, "department": "IT"},
{"id": 4, "name": "David", "age": 40, "salary": 80000, "department": "Finance"}
]
Schema: [("id","bigint"), ("name","string"), ("age","bigint"), ("salary","double"), ("department","string")].
See SPARKLESS_PYTHON_TEST_PORT.md for port tracker and test_dataframe_parity.py for the actual tests.