Skip to content

Plan: Bugs and Improvements

A repeatable plan for finding bugs and needed improvements in robin-sparkless. Use this for periodic audits, before releases, or when onboarding.

Related docs: PARITY_STATUS.md, PYSPARK_DIFFERENCES.md, GAP_ANALYSIS_SPARKLESS_3.28.md, DEFERRED_SCOPE.md.


Part A: Finding bugs

A1. Run the full test suite

Step Command What it catches
Rust unit + integration cargo test Regressions, doc tests
Parity (all phases) make test-parity-phases or pytest tests/parity/ Behavioral drift vs PySpark fixtures
Python Out-of-tree (e.g. Sparkless); no in-repo make test-python Bindings, expression semantics
Full CI make check-full Format, clippy, audit, deny, Rust tests, Python lint, Python tests

Fix any failing tests first; add a test for every bug you fix.

A2. Static analysis (zero warnings)

Tool Command Focus
Format cargo fmt --check Consistency
Clippy cargo clippy -- -D warnings Panics, unnecessary clones, wrong patterns
Audit cargo audit Known vulnerabilities
Deny cargo deny check advisories bans sources Policy (e.g. git sources)
Python make lint-python (ruff + mypy) Unused imports, types, style

Treat new warnings as potential bugs or tech debt.

A3. Panic- and crash-prone patterns

Search library code (exclude tests/, examples/, benches/):

  • .unwrap() — Prefer ?, unwrap_or, or return Result/Option.
  • .expect("...") — Only for documented invariants; avoid in PyO3 paths (poisoned lock → abort).
  • panic! / unreachable!() — Must match documented API contracts (e.g. in lib.rs).
  • Unchecked indexing — e.g. row[idx], args[n] in plan/udf code; validate length/arity first.

Past fixes include PyDataFrameWriter lock handling in src/python/dataframe.rs (poisoned lock fallback; Python bindings are now out-of-tree) and type-conversion functions in src/functions.rs (to_char, to_number, etc.) now returning Result instead of panicking.

A4. Error handling at boundaries

  • Public Rust APIs: Can fail → return Result<_, PolarsError> (or appropriate type); use ? internally.
  • Python: Rust ResultPyResult; raise clear Python exceptions; no panics across FFI.
  • Hot paths: create_dataframe, create_dataframe_from_rows, read_*, filter, select, joins, execute_plan, SQL.

Check that invalid input (wrong schema, missing columns, bad plan JSON) produces clear errors, not panics.

A5. Python bindings (out-of-tree)

  • Python bindings are maintained outside this repo (e.g. in Sparkless). When adding or changing Rust API used by bindings: document optional/Result types and JSON-friendly row shapes; keep error messages clear for FFI callers.
  • Rust surface: create_dataframe_from_rows, create_dataframe_from_single_column, verify_schema, collect_as_json_rows, execute_plan are the main entry points for bindings. (Historical: extraction order #182, locks in code called from Python.)

A6. Parity and known gaps

  • Skipped fixtures: In tests/fixtures/*.json, any "skip": true — confirm reason still valid and not a latent bug we should fix.
  • Expected-error fixtures: "expect_error": true — ensure the harness still expects failure and the error type/message is still correct.
  • Plan fixtures: tests/fixtures/plans/*.json — run plan_parity_fixtures; add tests for new plan ops when you add them.

A7. Fuzz and edge cases (optional)

  • Empty inputs: Empty DataFrame, empty schema, zero-row reads, limit(0).
  • Nulls: Null in partition keys, in join keys, in aggregates, in expressions (e.g. when(null)).
  • Large inputs: Very wide schema, many partition keys, huge row count (if you care about OOM or perf).

Part B: Finding needed improvements

B1. API and parity gaps (in scope)

Source Doc / Command What to do
Gap vs Sparkless 3.28 GAP_ANALYSIS_SPARKLESS_3.28.md List of functions/methods not yet in robin-sparkless; prioritize ones in scope (not in DEFERRED_SCOPE.md).
Gap vs PySpark repo make gap-analysis / make gap-analysis-quick Produces/updates gap list; implement or document.
Missing vs PySpark ROBIN_SPARKLESS_MISSING.md Canonical “missing vs PySpark” list; track closure.
Signature alignment SIGNATURE_ALIGNMENT_TASKS.md, SIGNATURE_GAP_ANALYSIS.md Optional params, overloads, aliases (e.g. two-arg when).

Improvement types: Implement missing function, add optional parameter, add alias, add fixture for existing behavior.

B2. Behavioral differences (document or fix)

  • Doc: PYSPARK_DIFFERENCES.md — every intentional or known divergence (stubs, JVM, crypto, createDataFrame, etc.).
  • Improvements: Where we can match PySpark without breaking existing users, consider changing behavior and documenting; else keep doc up to date.

B3. Performance

Area How to check Improvement ideas
Rust vs Polars cargo bench (e.g. filter_select_groupby) Keep within ~2x of raw Polars; reduce clones, unnecessary collects.
Python overhead PyO3 call overhead, repeated collect/show Batch operations, avoid round-trips where possible.
IO Large CSV/Parquet/JSON read/write Reader/writer options, streaming if needed.

B4. Code quality and maintainability

  • Duplication: Same logic in Rust and Python (e.g. option handling); consider helpers or codegen.
  • Complexity: Long functions in plan/expr.rs, functions.rs, or Python bindings; split or document.
  • Tests: New behavior should have a unit, integration, or parity test; document in TEST_CREATION_GUIDE.md.
  • Docs: Public API docs, lib.rs panic/behavior notes, and user-facing docs (QUICKSTART.md, EMBEDDING.md) up to date.

B5. Sparkless integration (Phase 27)

  • Fixture converter: Sparkless expected_outputs → robin-sparkless fixture format (SPARKLESS_INTEGRATION_ANALYSIS.md §4).
  • Run Sparkless test suite with Robin backend: find failing tests; fix bugs or add missing API so more tests pass.
  • Reported upstream: SPARKLESS_PARITY_ISSUES_REPORTED.md — when Sparkless fixes issues, re-run and adjust Robin if needed.

B6. Roadmap and deferred scope

  • Roadmap: ROADMAP.md, FULL_BACKEND_ROADMAP.md — Phase 26 (publish), Phase 27 (Sparkless integration).
  • Deferred: DEFERRED_SCOPE.md — RDD, UDF, streaming, XML, sketch, etc. Improvements here are “out of scope” unless the project decides otherwise; document workarounds.

Part C: Prioritization

  1. Bugs first: Failing tests, panics in library code, wrong results in parity tests, PyO3 crashes.
  2. Then: Error handling (clear errors instead of panics), static analysis clean (clippy, mypy, ruff).
  3. Then: High-value API gaps (in-scope functions used by Sparkless or commonly requested).
  4. Then: Performance, code quality, docs, and remaining gap closure.

Use GitHub issues to track bugs and improvements; tag with bug, improvement, parity, sparkless, etc.


Quick checklist (run periodically)

  • [ ] make check-full passes.
  • [ ] No new .unwrap() / .expect() in library code without a comment or doc.
  • [ ] Parity: no new skipped fixtures without a reason; expect_error fixtures still valid.
  • [ ] PYSPARK_DIFFERENCES.md and PARITY_STATUS.md updated if behavior or coverage changed.
  • [ ] Open bugs from GitHub triaged or fixed.
  • [ ] Gap analysis (quick or full) run if you touched the API surface.