Persistence Between Sessions: Design¶
Status: Implemented (Feb 2026). Both Option A (global temp views) and Option B (disk-backed saveAsTable) are implemented.
This document explores persistence between sessions as an optional feature that aligns with PySpark's interface.
PySpark Reference Model¶
| Mechanism | Scope | Lifetime | Access |
|---|---|---|---|
| createOrReplaceTempView | Session | Until session closes | spark.table("name") |
| createOrReplaceGlobalTempView | Application | Until process ends | spark.table("global_temp.name") |
| saveAsTable (Hive metastore) | Application / cluster | Disk-backed | spark.table("db.table") or spark.table("table") |
Robin-sparkless currently implements temp views and saveAsTable as in-memory, session-scoped only.
Proposed Options¶
Option A: Global Temp Views (PySpark-aligned, in-memory)¶
Goal: createOrReplaceGlobalTempView persists across sessions within the same process.
PySpark behavior:
- df.createOrReplaceGlobalTempView("people") registers in global_temp database
- Any SparkSession in the same application can access via spark.table("global_temp.people")
- Dropped when the Spark application (process) ends
Implementation:
1. Add a process-wide GlobalTempViewCatalog (e.g. Arc<Mutex<HashMap<String, DataFrame>>>)
2. SparkSession holds an Arc reference to this shared catalog (injected at construction or via OnceLock)
3. create_or_replace_global_temp_view / create_global_temp_view → write to global catalog (not session catalog)
4. drop_global_temp_view → remove from global catalog
5. table("global_temp.xyz") → resolve from global catalog (separate from temp view / saved table lookup)
6. SQL translator: support FROM global_temp.xyz in addition to plain table names
Resolution order for table(name):
- If name is global_temp.xyz: look up in global catalog only
- Else: (1) temp view, (2) saved table (unchanged)
Rust: Need a way to share the global catalog across SparkSession instances. Options:
- lazy_static! / OnceLock holding the catalog; all sessions reference it
- SparkSessionBuilder accepts optional SharedCatalog; when using default builder, use global OnceLock
Python: The PyO3 extension lives in one process. All SparkSession instances can share the same global catalog. No extra config needed.
Effort: ~1–2 days. Low risk, backward compatible (global temp views are currently stubs that delegate to temp views).
Option B: Disk-Backed saveAsTable (Warehouse)¶
Goal: saveAsTable(name) can optionally persist to disk so new sessions (or restarted processes) can load the table.
PySpark behavior:
- With Hive metastore, saveAsTable writes to spark.sql.warehouse.dir (or table location)
- spark.table("name") resolves from metastore + warehouse
Implementation:
1. Config: spark.sql.warehouse.dir (default: None = in-memory only, current behavior)
2. When spark.sql.warehouse.dir is set (e.g. "/tmp/robin_warehouse"):
- saveAsTable(name, mode) writes DataFrame to {warehouse}/{name}/ as Parquet
- Mode semantics: overwrite = replace dir; append = read existing + concat + write; error/ignore = current
3. table(name) resolution when not in session catalogs:
- If warehouse configured and {warehouse}/{name}/ exists: read_parquet that path
4. dropTable(name): if warehouse-backed, delete the directory (or just remove from session catalog if we don't track persistence explicitly)
Considerations:
- Schema evolution: append mode needs compatible schemas
- We don't have a "metastore" — just a directory layout. Table = directory.
- Optional feature flag? e.g. --features persistence or always-on when warehouse is set
Effort: ~3–5 days. Medium complexity (IO, path handling, schema on read).
Option C: Both (Recommended)¶
- Phase 1: Implement Option A (global temp views) — quick win, full PySpark parity for global temp
- Phase 2: Implement Option B (disk-backed saveAsTable) — enables true cross-process persistence
API Surface Changes¶
New / Updated Methods¶
| Method | Current | After Option A | After Option B |
|---|---|---|---|
create_or_replace_global_temp_view |
Stub → temp view | Writes to global catalog | Same |
create_global_temp_view |
Stub → temp view | Writes to global catalog | Same |
drop_global_temp_view |
Stub → drop temp view | Drops from global catalog | Same |
table(name) |
temp → saved | + global_temp.x → global catalog |
+ fallback to warehouse |
saveAsTable(name, mode) |
In-memory only | Same | + optional disk when warehouse set |
listTables(dbName) |
Session only | + list global when dbName=global_temp? |
+ warehouse tables? |
Config¶
| Config Key | Default | Description |
|---|---|---|
spark.sql.warehouse.dir |
(unset) | When set, saveAsTable persists to this directory as Parquet |
| (none for global temp) | — | Global temp is always on when sql feature enabled |
SQL¶
FROM global_temp.people→ resolve from global catalogFROM people→ existing resolution (temp view, saved table, then warehouse if Option B)
Backward Compatibility¶
- Option A: Current
create_global_temp_viewis a stub. Real implementation is additive; no breaking changes. - Option B: When
spark.sql.warehouse.diris unset, behavior unchanged (in-memory only).
Open Questions¶
- Rust singleton: Should Rust have a global catalog at process level, or should it be explicit (e.g.
SparkSessionBuilder::with_global_catalog())? PySpark implicitly shares; we could useOnceLockfor zero-config. - Python multi-session: Currently each
get_or_create()creates a new innerSparkSession(with fresh catalogs) but overwrites the default. Do we need true multi-session support, or is "one default session" sufficient for most use cases? - Catalog listing: Should
listTables(dbName="global_temp")return global temp view names? PySpark's catalog supports this. - Warehouse layout: Flat
{warehouse}/{table_name}/or{warehouse}/default/{table_name}/to mirrordefaultdatabase?
Implementation Summary¶
Both options are implemented:
- Option A (Global Temp Views): Process-wide catalog via
OnceLock.createOrReplaceGlobalTempView/createGlobalTempViewwrite to shared catalog.table("global_temp.xyz")resolves from it.listTables(dbName="global_temp")returns global temp view names. SQL supportsFROM global_temp.xyz. - Option B (Warehouse): When
spark.sql.warehouse.diris set,saveAsTablewrites to{warehouse}/{name}/data.parquet.table(name)falls back to warehouse when not in session catalogs. Supports all modes: error, overwrite, append, ignore.