Robin Sparkless User Guide¶
This guide shows you how to use Robin Sparkless for everyday data work. It assumes basic familiarity with DataFrame concepts (like PySpark or Pandas).
What is Robin Sparkless?¶
Robin Sparkless is a PySpark-style DataFrame library that runs in Rust with Polars as the engine—no JVM. You get:
- Familiar APIs:
SparkSession,DataFrame,filter,select,group_by, etc. - Lazy by default: transformations extend the plan; only actions (
collect,show,count,write) trigger execution—aligns with PySpark and enables Polars query optimization. - Fast execution on Polars
Two expression APIs¶
- ExprIr (engine-agnostic): Use
col,lit_i64,gt,when, etc. from the crate root; they build anExprIrtree. Usefilter_expr_ir,select_expr_ir,with_column_expr_ir,collect_rows, andGroupedData::agg_expr_ir. Prefer this for new code and embeddings; errors areEngineError, and the public API does not expose Polars types. - Column / Expr (Polars): Use the prelude or
robin_sparkless::functionsforcol,lit_i64, etc., which returnColumn. Usefilter,with_column,select_exprs, and the full set of string/window/aggregate functions. Use this when you need the full PySpark-like API or are porting existing Column-based code.
The rest of this guide shows the Column API (prelude) for maximum familiarity; you can substitute the ExprIr equivalents where noted.
Installation¶
Rust¶
Add to Cargo.toml:
Optional features:
robin-sparkless = { version = "4", features = ["sql"] } # spark.sql(), temp views
robin-sparkless = { version = "4", features = ["delta"] } # Delta Lake read/write
robin-sparkless = { version = "4", features = ["jdbc"] } # PostgreSQL read/write
robin-sparkless = { version = "4", features = ["sqlite"] } # SQLite read/write
robin-sparkless = { version = "4", features = ["jdbc_mysql"] } # MySQL/MariaDB
Getting Started¶
Your First Session¶
use robin_sparkless::SparkSession;
let spark = SparkSession::builder()
.app_name("my_app")
.get_or_create();
Creating a DataFrame¶
let df = spark.create_dataframe(
vec![
(1, 25, "Alice".to_string()),
(2, 30, "Bob".to_string()),
(3, 35, "Charlie".to_string()),
],
vec!["id", "age", "name"],
)?;
From files
let df = spark.read_csv("data.csv")?;
let df = spark.read_parquet("data.parquet")?;
let df = spark.read_json("data.json")?;
Core Operations¶
Filter¶
Keep rows that satisfy a condition.
use robin_sparkless::{col, lit_i64};
let adults = df.filter(col("age").gt(lit_i64(25).into_expr()).into_expr())?;
Select¶
Choose columns (and optionally transform them) using select and expressions.
With Column¶
Add or replace a column with computed values using with_column or with_column_expr.
Order By and Limit¶
Sort by one or more columns and take the first N rows using order_by and limit.
Joins¶
Join two DataFrames on common columns using DataFrame::join with JoinType (Inner, Left, Right, Outer).
Aggregations¶
Group and aggregate with group_by and GroupedData methods such as count, sum, avg, min, max, and more.
Reading and Writing Data¶
Use SparkSession::read_csv, read_parquet, and read_json to read data, and DataFrame::write (writer API) to write Parquet/CSV/JSON.
JDBC / External Databases¶
Read from and write to external databases using a PySpark-compatible JDBC API. Supported backends:
| Backend | Feature | URL Example |
|---|---|---|
| PostgreSQL | jdbc |
jdbc:postgresql://localhost:5432/mydb |
| SQLite | sqlite |
jdbc:sqlite:/path/to/db.db |
| MySQL | jdbc_mysql |
jdbc:mysql://localhost:3306/mydb |
| MariaDB | jdbc_mariadb |
jdbc:mariadb://localhost:3307/mydb |
| SQL Server | jdbc_mssql |
jdbc:sqlserver://localhost:1433;databaseName=mydb |
| Oracle | jdbc_oracle |
jdbc:oracle:thin:@//localhost:1521/ORCL |
| DB2 | jdbc_db2 |
jdbc:db2://localhost:50000/mydb |
Add the feature in your Cargo.toml:
robin-sparkless = { version = "4", features = ["jdbc"] } # PostgreSQL
robin-sparkless = { version = "4", features = ["sqlite"] } # SQLite (file-based)
robin-sparkless = { version = "4", features = ["jdbc_mysql"] } # MySQL
Python Example¶
url = "jdbc:postgresql://localhost:5432/mydb"
props = {"user": "admin", "password": "secret"}
# Basic read
df = spark.read.jdbc(url=url, table="users", properties=props)
# Read with options (PySpark-compatible)
df = (spark.read
.format("jdbc")
.option("url", url)
.option("dbtable", "users")
.option("sessionInitStatement", "SET timezone='UTC'")
.option("queryTimeout", "30")
.options(props)
.load("."))
# Write with batching and truncate
df.write.jdbc(
url=url,
table="users_backup",
properties={"batchsize": "5000", "truncate": "true", **props},
mode="overwrite"
)
# SQLite (no server required)
df = spark.read.jdbc(url="jdbc:sqlite:/tmp/test.db", table="my_table", properties={})
df.write.jdbc(url="jdbc:sqlite:/tmp/test.db", table="results", properties={}, mode="append")
Supported Options¶
| Option | Type | Description |
|---|---|---|
sessionInitStatement |
String | SQL to execute after connection (e.g., SET timezone='UTC') |
queryTimeout |
Integer | Query timeout in seconds |
prepareQuery |
String | SQL to execute before main query (for CTEs, temp tables) |
fetchsize |
Integer | Rows per fetch |
batchsize |
Integer | Rows per batch/transaction on write (default: 1000) |
truncate |
Boolean | Use TRUNCATE vs DELETE for Overwrite mode |
cascadeTruncate |
Boolean | Add CASCADE to TRUNCATE (PostgreSQL/Oracle) |
Save Modes¶
| Mode | Behavior |
|---|---|
append |
Insert rows into existing table |
overwrite |
Truncate/delete existing data, then insert |
error |
Error if table has any existing rows |
ignore |
Do nothing if table has existing rows |
See JDBC_TESTING.md for setup, Docker Compose files, and CI configuration.
SQL (Optional)¶
With the sql feature, you can run SQL against temp views.
spark.create_or_replace_temp_view("people", df.clone());
let result = spark.sql("SELECT name, age FROM people WHERE age > 25 ORDER BY age")?;
Supports: SELECT, FROM, JOIN, WHERE, GROUP BY, ORDER BY, LIMIT. Built-in functions (e.g. UPPER, LOWER) and registered UDFs work in SQL.
User-Defined Functions¶
Register custom functions and use them in DataFrames or SQL.
Python UDF
def double(x):
return x * 2 if x is not None else None
my_udf = spark.udf().register("double", double, return_type="int")
df2 = df.with_column("doubled", my_udf(rs.col("id")))
SQL
spark.udf().register("double", double, return_type="int")
result = spark.sql("SELECT id, double(id) AS doubled FROM people")
See UDF Guide for full details.
Persistence and Tables¶
- Temp views:
df.createOrReplaceTempView("my_table")— in-session only - Global temp views:
df.createOrReplaceGlobalTempView("global_table")— visible across sessions - Saved tables:
df.write().saveAsTable("my_table", mode="overwrite")— disk-backed whenspark.sql.warehouse.diris set
See Persistence Guide for more.
Common Patterns¶
Chaining Operations¶
result = (
df.filter(rs.col("age") > 18)
.select([rs.col("name"), rs.col("age")])
.order_by(["age"], ascending=[False])
.limit(10)
)
Conditional Logic (when/then/otherwise)¶
# Nested when/then/otherwise for multiple conditions
df2 = df.with_column(
"category",
rs.when(rs.col("age") >= 65)
.then(rs.lit("senior"))
.otherwise(
rs.when(rs.col("age") >= 18).then(rs.lit("adult")).otherwise(rs.lit("minor"))
),
)
Handling Nulls¶
df2 = df.with_column("age_filled", rs.coalesce(rs.col("age"), rs.lit(0)))
df2 = df.na().fill(rs.lit(0)) # Fill nulls in all columns with 0
df2 = df.na().drop(subset=["name"]) # Drop rows with null in "name"
Collecting Results¶
Rust
let rows = df.collect_as_json_rows()?; // Vec<HashMap<String, JsonValue>>
df.show(Some(20))?; // Print to stdout
Python
rows = df.collect() # List of dicts
df.show(20) # Print to stdout
# to_pandas() returns list of dicts; for a pandas DataFrame use:
# pandas.DataFrame.from_records(df.to_pandas())
Example collect() output for the quick-start DataFrame (id, age, name):
[{'id': 1, 'age': 25, 'name': 'Alice'}, {'id': 2, 'age': 30, 'name': 'Bob'}, {'id': 3, 'age': 35, 'name': 'Charlie'}]
Troubleshooting¶
| Error | Cause | Fix |
|---|---|---|
| Column 'X' not found | Typo or wrong case | Check column names with df.columns() |
| create_dataframe: expected 3 column names | Rust create_dataframe needs exactly 3 columns |
In Python use createDataFrame(data, schema) for any schema |
| call_udf: no session | UDF used before session created | Use SparkSession.builder().get_or_create() first |
| SQL: unknown function | Function not built-in or UDF | Register with spark.udf().register() or use a built-in |
Next Steps¶
- Quickstart — Build from source, more examples For end-to-end API details, see the Rust docs on docs.rs.
- UDF Guide — Custom functions in detail
- Persistence Guide — Temp views, tables, warehouse
- PySpark Differences — How Robin differs from PySpark