This section delves into the core concepts of workload analysis in RisingWave, providing a deeper understanding of key performance metrics like latency and throughput, and exploring the critical role of backpressure in maintaining system stability and efficiency. This knowledge will be valuable for both proactive performance tuning and troubleshooting.

Key concepts

Let’s begin with two fundamental metrics that users care about: latency and throughput. In performance optimization, the goal is always to achieve high throughput and low latency.

End-to-end latency vs. processing time

When discussing latency, it’s important to distinguish between end-to-end latency and processing time:

  • End-to-end latency: The total time it takes from when a piece of data is generated by the upstream system to when the corresponding result is available to the downstream consumer. This includes all processing steps, network delays, and any queuing within RisingWave.
  • Processing time: The duration a message spends being actively processed within RisingWave. This is a component of end-to-end latency.

The relationship between these two types of latency is illustrated below:

Figure 1: Relationship between end-to-end latency and processing time.

Barrier latency

A barrier is a special type of message that RisingWave injects periodically into the data stream of all sources. Barriers flow along with regular data through the entire stream processing graph (all operators and actors). Barrier latency refers to the time it takes for a batch of barriers (one from each source) to travel from the meta node (where they are emitted) to all compute nodes (where they are collected). This metric provides a statistical measure of the current processing time within the cluster. See Monitoring and metrics - Barrier monitoring for details on viewing barrier latency.

Figure 2: Barrier flow in RisingWave.

Barrier latency is particularly important because all operations in RisingWave that involve changes to the computation graph (e.g., DDL statements, scaling operations) are propagated via barriers. This ensures consistency. Therefore, the time taken for these operations is directly related to barrier latency.

Throughput

Throughput is the number of events (data records) that the system can process in a given amount of time (e.g., rows per second, events per second). For any given RisingWave cluster, the maximum processing throughput is limited by the available resources (CPU, memory, network bandwidth, etc.). See Monitoring and metrics - Data ingestion to check the source throughput.

It’s important to understand that the rate at which RisingWave consumes data from the upstream system doesn’t always represent its processing capacity. If the upstream system generates data faster than RisingWave can process it, the data will start to backlog.

Figure 3: Data backlog due to exceeding processing capacity.

In this scenario, consuming data from the upstream without limit is counterproductive. Excessive queuing consumes resources and can lead to instability. This is where backpressure becomes essential.

Understanding backpressure

How backpressure works in RisingWave

Backpressure occurs when RisingWave’s maximum processing throughput cannot keep up with the upstream system’s data generation rate. In such cases, RisingWave automatically reduces its consumption of upstream data to match its processing capabilities. This prevents data from accumulating excessively within RisingWave, avoiding wasted resources and potential out-of-memory (OOM) errors. Backpressure also helps maintain low and stable processing time and barrier latency.

Figure 4: Backpressure mechanism in RisingWave.

For example, if a streaming pipeline has a maximum processing capacity of 200,000 records per second, and the upstream system generates 1 million records in a single second (due to a traffic spike), an ideal backpressure mechanism would:

  1. Continue to consume data at the maximum rate (200,000 records/second).
  2. Allow the remaining 800,000 records to backlog in the upstream system.
  3. Consume and process the backlogged data later, after the traffic spike subsides.

This might increase the end-to-end latency temporarily (due to the backlog), but it’s the optimal approach given limited resources.

Backpressure propagation between actors

RisingWave’s parallel execution model splits a streaming job into multiple fragments. Each fragment is horizontally partitioned into multiple actors, enabling parallel computation and scalability.

Figure 5: Parallel execution model in RisingWave.

Upstream and downstream actors are connected by bounded channels that buffer data. Each actor continuously:

  1. Consumes data from its upstream channel.
  2. Processes the data.
  3. Sends the results to the downstream actor’s channel.

If a downstream actor cannot consume data as quickly as the upstream actor produces it, the channel will fill up. When the channel is full, the upstream actor will pause its consumption and processing, waiting for the downstream actor to catch up. This creates backpressure.

Figure 6: Backpressure propagation between actors.

This mechanism ensures that backpressure propagates upstream from a slow downstream actor, ultimately reaching the source operator, which then reduces its consumption from the external system.

Backpressure details in the source executor

Barriers, as mentioned earlier, are periodically emitted by the meta node and sent to a dedicated barrier channel for each source executor. The source executor consumes both barriers (from the barrier channel) and data (from the external system), merging them into a single stream.

The source executor always prioritizes the consumption of barriers. It only consumes data from the external system when the barrier channel is empty.

Figure 7: Barrier and data stream merging in the source executor.

Symptoms of backpressure

  • High barrier latency: As discussed earlier, high barrier latency is a key indicator of backpressure and slow stream processing.
  • Sawtooth-like barrier metrics: See the “Backpressure’s Challenges” section below for a detailed explanation of this phenomenon.
  • High blocking time ratio between actors.

Identifying bottlenecks

To identify the specific location of a bottleneck causing backpressure:

  1. Use Grafana: Navigate to the “Streaming - Backpressure” panel in the Grafana dashboard.
  2. Identify High-Backpressure Channels: Look for channels with high “Actor Output Blocking Time Ratio.”
  3. Find the Frontmost Bottleneck: Backpressure propagates upstream, so the frontmost (closest to the source) actor with high backpressure is likely the root cause.
  4. Correlate with SQL: Use the RisingWave Dashboard’s “Fragment” panel and the EXPLAIN CREATE MATERIALIZED VIEW ... command to identify the corresponding part of your SQL query.

Backpressure’s challenges

While backpressure is essential for stability, certain factors can make it less effective or lead to undesirable behavior:

The impact of sluggish backpressure on barriers (sawtooth metrics)

Consider a simplified scenario:

  • No parallelism (single source executor).
  • Maximum pipeline throughput: 10,000 rows/second.
  • Source executor consumes 100,000 rows before experiencing backpressure.
  • Barrier interval: 1 second.

In this case, the downstream will take 10 seconds (100,000 rows / 10,000 rows/second) to process the data. During these 10 seconds, the source executor is backpressured and won’t consume data or barriers. The 10 barriers issued by the meta node will accumulate in the barrier channel.

Once the downstream finishes processing, the source executor will prioritize the 10 backlogged barriers (which are processed very quickly). Then, it will consume another 100,000 rows from the upstream, and the cycle repeats.

This creates a “sawtooth” pattern in the barrier number metric (the count of uncollected barriers):

Figure 8: Sawtooth pattern in barrier number.

The barrier latency will also fluctuate. The first barrier in the channel might wait for 10 seconds, the next for 9 seconds, and so on. However, metrics like p90 or pmax barrier latency will tend to hover around the maximum value (10 seconds in this example).

Figure 9: Barrier latency during backpressure.

This “sawtooth” pattern is often observed in real-world scenarios when backpressure is not applied quickly enough.

Impact of buffering and solution: limit concurrent barriers in buffer

Buffers between upstream and downstream actors are necessary to smooth out network latency and prevent pipeline stalls. However, excessively large buffers can exacerbate the problems described above.

If a downstream channel has a buffer of 100,000, the source might send 100,000 records before experiencing backpressure. In practice, the total buffer size across the entire pipeline can be much larger (the individual buffer size multiplied by parallelism and the number of pipeline layers). This can lead to a situation resembling “batch processing,” with the system alternating between:

  1. Ingesting data at the maximum rate to fill the buffer.
  2. Entering backpressure and processing the buffered data.

This results in high barrier latency and processing time spikes.

To address this, RisingWave limits the number of concurrent barrier messages within the same channel. This makes the backpressure mechanism more responsive. Even if the data buffer isn’t full, the presence of a barrier will trigger backpressure on the upstream.

Uncertain consumption per record

Another challenge is that the resource consumption (CPU, memory, I/O) for processing each record can vary significantly. This variability is influenced by factors like:

  • Data Amplification: Operators like joins can produce multiple output rows for a single input row. The degree of amplification depends on the data itself.
  • Cache Thrashing: The performance of stateful operators relies heavily on caching. If data locality is poor (records don’t share common state), cache misses increase, leading to more expensive storage access.

This variability makes it difficult to predict the resources needed for subsequent records, reducing the effectiveness of simple backpressure mechanisms based solely on the number of records. A small number of records with very high processing costs can still cause significant delays.