Backfill - RisingWave

Overview

When you create a materialized view (MV) on top of existing tables or sources, RisingWave must first compute the initial state by processing all existing data. This process is called backfill. After backfill completes, the MV transitions to incremental processing, where it continuously updates as new data arrives.

Backfill strategies

RisingWave automatically selects a backfill strategy based on the MV definition and upstream data sources:

Snapshot backfill

Used when creating MVs on tables with a consistent snapshot. The executor:

Takes a snapshot of the upstream table at a specific epoch
Scans the snapshot data in batches
Applies the query transformation
Writes results to the MV state
Transitions to incremental mode once the snapshot is fully processed

Arrangement backfill

Used for complex queries requiring joins or aggregations during backfill. This strategy:

Maintains intermediate state during backfill
Handles updates to upstream tables during backfill
Ensures consistency through barrier alignment

No-shuffle backfill

Optimized strategy that avoids data redistribution when upstream and downstream parallelism match and data distribution is compatible.

Monitoring backfill progress

Backfill progress is tracked per fragment and reported to the meta service. You can monitor progress through:

RisingWave Dashboard: View fragment-level progress
System tables: Query rw_catalog.rw_ddl_progress (if available)
Logs: Search for “backfill” in compute node logs

Progress is reported as a percentage of rows processed relative to the total snapshot size.

Performance considerations

Resource allocation

Backfill can be resource-intensive, especially for large upstream tables:

CPU: Backfill executors compete with regular streaming executors for CPU
Memory: Snapshot data and intermediate state consume memory
Storage I/O: Reading snapshot data and writing MV state generates I/O load

Parallelism control

You can configure separate parallelism settings for backfill operations to optimize resource utilization:

streaming_parallelism_for_backfill: Specifies parallelism during backfill phase
After backfill completes, the job automatically switches to use streaming_parallelism

This allows you to use lower parallelism during backfill to reduce resource contention, then scale up for normal streaming operations. For example:

SET STREAMING_PARALLELISM = 4;
SET STREAMING_PARALLELISM_FOR_BACKFILL = 2;
CREATE MATERIALIZED VIEW mv AS SELECT * FROM large_table;

The materialized view will use parallelism of 2 during backfill, then automatically switch to parallelism of 4 after completion. For more details, see Configuring backfill parallelism.

Concurrent backfills

The max_concurrent_creating_streaming_jobs system parameter (default: 1) limits how many backfills can run simultaneously. This prevents resource exhaustion when creating multiple MVs at once.

Backfill and recovery

If a compute node fails during backfill, RisingWave will resume backfill from the last completed checkpoint, not from the beginning. This is possible because:

Backfill progress is persisted in the meta store
Barriers coordinate progress across all fragments
State is checkpointed regularly during backfill

New to barriers/checkpoints? A barrier is a periodic sync marker; a checkpoint is a global consistent snapshot created from barriers. By default, RisingWave generates one barrier every 1 second (barrier_interval_ms = 1000). See Data persistence.

Best practices

Create MVs during low-traffic periods to minimize resource contention
Monitor barrier latency during backfill; high latency may indicate resource constraints
For very large tables, consider creating the MV incrementally using filtered views
Use background DDL (SET BACKGROUND_DDL = true) for non-blocking MV creation

Serverless backfilling (RisingWave Cloud)

On RisingWave Cloud, you can offload backfill work to dedicated, auto-scaling backfiller nodes that are separate from your streaming cluster. This reduces contention between large backfill jobs and live streaming workloads. To use serverless backfilling:

Enable it in the Backfilling section of the Resources page in the RisingWave Cloud console.

Create the materialized view with the cloud.serverless_backfill_enabled = true option:

CREATE MATERIALIZED VIEW my_mv
WITH (cloud.serverless_backfill_enabled = true)
AS SELECT * FROM my_source_table;

For details on configuring backfiller SKU, replicas, and Cloud setup, see Serverless backfilling.

​Overview

​Backfill strategies

​Snapshot backfill

​Arrangement backfill

​No-shuffle backfill

​Monitoring backfill progress

​Performance considerations

​Resource allocation

​Parallelism control

​Concurrent backfills

​Backfill and recovery

​Best practices

​Serverless backfilling (RisingWave Cloud)

Overview

Backfill strategies

Snapshot backfill

Arrangement backfill

No-shuffle backfill

Monitoring backfill progress

Performance considerations

Resource allocation

Parallelism control

Concurrent backfills

Backfill and recovery

Best practices

Serverless backfilling (RisingWave Cloud)