Skip to content

DataFrame test guide

DataFrame Test Guide (Python)

This guide describes how to write and maintain Python dataframe tests under tests/dataframe so they:

  • Use the shared test harness (fixtures + backend abstraction).
  • Treat PySpark behavior as the source of truth.
  • Work against both the PySpark backend and the Robin backend without changing test code.

1. Use the shared harness, not direct PySpark/sparkless

  • Do NOT import PySpark or sparkless directly in tests.
  • Always import through the shared helpers:
from tests.fixtures.spark_imports import get_spark_imports
  • Let tests/conftest.py provide the spark fixture and backend selection based on:
  • SPARKLESS_TEST_BACKEND / MOCK_SPARK_TEST_BACKEND
  • @pytest.mark.backend("pyspark" | "robin" | "both") markers

2. Canonical test pattern

For new or refactored tests in tests/dataframe, follow this structure:

from tests.fixtures.spark_imports import get_spark_imports


class TestSomeBehavior:
    def test_example(self, spark):
        imports = get_spark_imports()
        F = imports.F

        df = spark.createDataFrame(
            [
                {"id": 1, "value": 10},
                {"id": 2, "value": 20},
            ]
        )

        result = df.withColumn("double", F.col("value") * 2)
        rows = result.collect()

        assert len(rows) == 2
        assert rows[0]["double"] == 20

Key rules:

  • Take spark as a fixture argument; never construct SparkSession manually.
  • Get all functions/types (F, StructType, StringType, Window, …) from get_spark_imports().
  • Never call spark.stop() in tests; the fixture handles lifecycle.

3. Backend‑aware / PySpark‑only tests

Some tests are specifically about PySpark behavior (parity or API shape). For these:

  • Still use spark + get_spark_imports(); do not import PySpark directly.
  • Mark tests that must only run on PySpark with a backend marker, e.g.:
import pytest
from tests.fixtures.spark_backend import BackendType, get_backend_type


@pytest.mark.backend("pyspark")
def test_pyspark_only_behavior(spark):
    imports = get_spark_imports()
    F = imports.F
    ...

or use a helper such as get_backend_type() in a skip condition only when necessary. Prefer markers when possible so selection is obvious in -m / CI filters.


4. Expectations must match real PySpark 3.x

When writing or updating expectations:

  • Run the test in PySpark mode to see the real behavior:
SPARKLESS_TEST_BACKEND=pyspark pytest tests/dataframe/test_some_file.py -q
  • Assert on whatever PySpark actually does:
  • Datetime parsing and time zone behavior (e.g. spark.sql.legacy.timeParserPolicy).
  • Window function semantics (no window expressions allowed directly in WHERE; use withColumn + filter, or assert the specific AnalysisException).
  • Exact error types/messages only when they are stable and important (e.g. API misuse).

Examples from existing tests:

  • test_issue_170_to_date_timestamp_type.py:
  • Uses spark.conf.set("spark.sql.legacy.timeParserPolicy", "CORRECTED") so timestamp parsing matches Spark 3.x expectations for ISO strings with microseconds.
  • test_to_timestamp_compatibility.py:
  • Asserts the accepted input types for to_timestamp and adapts tests where PySpark no longer raises a TypeError for certain implicit casts.

5. Common patterns to avoid

  • Avoid creating sessions manually:
# Avoid
from tests.fixtures.spark_imports import get_spark_imports
SparkSession = get_spark_imports().SparkSession
spark = SparkSession.builder.appName("test").getOrCreate()

Always use the spark fixture instead.

  • Avoid reading environment variables directly in tests to choose backends:
  • Do not re‑implement _is_pyspark_mode() by inspecting SPARKLESS_TEST_BACKEND / MOCK_SPARK_TEST_BACKEND.
  • Let tests.fixtures.spark_backend handle that and use markers or get_backend_type() if you truly need conditional logic.

6. Running dataframe tests

  • PySpark backend (parity / expectation updates):
SPARKLESS_TEST_BACKEND=pyspark pytest tests/dataframe -n 10
  • Robin backend (default):
pytest tests/dataframe -n 10

For focused work on a single module:

SPARKLESS_TEST_BACKEND=pyspark pytest tests/dataframe/test_issue_170_to_date_timestamp_type.py

7. Harness verification (sparkless vs robin mode)

Tests use the proper harness when they:

  • Session: Use the spark fixture (e.g. def test_foo(self, spark):) or, when the fixture cannot be used, get_spark(...) / get_session(...) from tests.utils. Never construct a session with SparkSession.builder.appName(...).getOrCreate().
  • Imports: Use get_spark_imports() from tests.fixtures.spark_imports for F, types, Window, etc., or the legacy get_functions() / get_window_cls() from tests.utils. Never import pyspark or import sparkless in test files.

To find tests that still create sessions manually:

rg 'SparkSession\.builder\.appName|SparkSession\.builder\.getOrCreate' tests --glob 'test_*.py' -l

Fix them by taking spark as a fixture argument and removing manual session creation (and any spark.stop() calls).


8. Where to look for examples

  • Harness & fixtures:
  • tests/conftest.py
  • tests/fixtures/spark_backend.py
  • tests/fixtures/spark_imports.py
  • Canonical unit‑style tests:
  • tests/unit/test_issue_270_tuple_dataframe.py
  • tests/unit/dataframe/test_na_fill_robust.py
  • Dataframe tests using the new pattern:
  • tests/dataframe/test_issue_170_to_date_timestamp_type.py
  • tests/dataframe/test_function_api_compatibility.py
  • tests/dataframe/test_to_timestamp_compatibility.py