Skip to main content
This report details the performance benchmarks of the RisingWave embedded Iceberg compaction engine compared to a standard Apache Spark setup. The tests evaluate execution time, resource efficiency, and stability under high-complexity delete scenarios.

Test environment

Both engines were executed on identical hardware to ensure a fair comparison.

Hardware specifications

  • Cloud provider: AWS
  • Instance type: m5.4xlarge
  • vCPU: 16 Cores
  • Memory: 64 GB
  • Storage: EBS (General Purpose SSD)

Software configuration

Apache Spark The Spark job was tuned for the instance size to maximize resource utilization without causing immediate out-of-memory errors on startup.
  • spark.executor.memory = 40g
  • spark.executor.memoryOverhead = 8g
  • spark.driver.memory = 12g
  • spark.memory.fraction = 0.8
  • spark.memory.storageFraction = 0.2
  • spark.sql.shuffle.partitions = 1000
RisingWave
  • Compaction mode: Embedded (Rust/DataFusion)
  • Parallelism: Default configuration (Execution Parallelism: 64, Output Parallelism: 64)

Scenario 1: Bin-packing (small file compaction)

This scenario tests the “Small File” problem. The objective is to merge thousands of small, fragmented files into a few large, optimized files. No delete files are involved in this test.

Dataset parameters

  • Total data volume: ~193 GB
  • File count (input): ~17,358 files
  • Average file size: ~11 MB
  • Target file size: 512 MB

Results

MetricRisingWaveApache SparkDifference
Duration (Uncompressed)277 sec1,533 sec~5.5x Faster
Duration (ZSTD Level 5)369 sec1,923 sec~5.2x Faster
Input Files17,35817,358-
Output Files215215-
RisingWave demonstrated a consistent ~5x speedup across both uncompressed and ZSTD-compressed datasets, indicating that the performance differential is driven by framework overhead (JVM startup, task scheduling) rather than I/O or compression bottlenecks.

Scenario 2: High-complexity compaction (deletes)

This scenario tests Copy-on-Write (CoW) capabilities. The engine must read data files, load “delete files” (equality and position deletes) into memory, filter out deleted rows, and rewrite the data. This workload is memory-intensive due to the metadata overhead required for equality lookups.

Dataset parameters

  • Data files: 20,000
  • Position delete files: 20,000
  • Equality delete files: 20,000
  • Total input files: 60,000

Results

Workload DescriptionRisingWave ResultApache Spark Result
Standard Entropy
(10k equality + 10k position deletes)
SUCCESS
Time: 518s
FAILED
Status: Out of Memory (OOM)
High Entropy
(20k equality + 20k position deletes)
SUCCESS
Time: 490s
FAILED
Status: Out of Memory (OOM)
Massive Metadata
(High volume equality deletes ~20GB)
FAILED
Status: OOM
FAILED
Status: Out of Memory (OOM)
Observation: On the tested m5.4xlarge instance (64GB RAM), the Apache Spark job failed to complete the delete-heavy workloads, terminating repeatedly due to memory exhaustion. RisingWave successfully completed the High Entropy workload in 490 seconds, demonstrating higher memory efficiency for complex metadata operations on single-node architectures.

Resource utilization analysis

During the execution of the Bin-packing (uncompressed) test, we monitored the resource usage of the RisingWave compaction worker.
  • CPU utilization: The engine effectively saturated the available compute resources, averaging ~10 cores active usage out of 16 available cores during the merge phase.
  • Memory footprint: Memory usage remained stable around ~22 GB (approx. 35% of system RAM), leaving ample headroom for OS operations and preventing OOM kills.

Conclusion

The benchmarks indicate that RisingWave’s embedded compaction engine significantly outperforms a standard single-node Spark deployment for Iceberg maintenance tasks:
  1. Speed: Achieved a 5.5x speedup on standard bin-packing tasks.
  2. Efficiency: Eliminated the heavy startup and coordination overhead associated with distributed JVM frameworks.
  3. Stability: Successfully handled complex Delete/CoW workloads that caused OOM failures on Spark within the same hardware constraints.