Understand and monitor key performance indicators (KPIs) in RisingWave using Grafana dashboards.
Performance metrics provide crucial insights into the health and efficiency of your RisingWave deployment. Monitoring these metrics allows you to identify bottlenecks, optimize resource allocation, and ensure smooth operation. This section describes key metrics available on RisingWave’s Grafana dashboard and explains how to interpret them.
Alternative: RisingWave Console
You can also monitor your RisingWave cluster using the RisingWave Console, which provides built-in monitoring capabilities and diagnostic collection for comprehensive cluster observability.
The details of the dashboard can be found here. In particular, there are two files:
Monitoring resource utilization helps you understand how efficiently your cluster is using its CPU, memory, and other resources. High utilization can indicate bottlenecks, while low utilization might suggest opportunities for cost savings.
Among all of the components, we primarily focus on the CPU usage of compute nodes and compactor nodes. In the setting of the figure above, we have allocated 8 CPUs for the compute node and 8 CPUs for the compactor node.
Takeaway
A simplified description of RisingWave’s memory control mechanism:
In the figure above, we have allocated 12 GB of memory to the compute node. The real memory usage keeps fluctuating around 8.64 GB (over 90% of usable memory). This suggests that the eviction is triggered constantly as RisingWave tries to use more memory.
Takeaway
Operators such as join and aggregation are stateful. They maintain intermediate states in their operator cache to facilitate incremental computation. Efficient caching is crucial for minimizing latency and maximizing throughput.
For example, the following are the cache miss ratio metrics of the join operator, showing the metrics at the actor level. Each operator is parallelized by multiple actors, whose number is equal to the streaming_parallelism
. By default, the parallelism is the same as the number of CPUs on compute nodes.
The total lookups
metric denotes how many lookups a join operator performs per second, and the cache miss
metric denotes how many times the key does not exist in the memory and RisingWave has to fetch it from the storage engine.
In the case above, the cache miss rate is 707/10.8K~=6%
, which is quite low. Increasing memory may not do too much good.
Below is the same metric but for the aggregation operator.
The cache miss rate of actor 25 is 658/2.45K ~= 27%
, which is relatively high. It indicates that we are likely to improve the performance if we increase the memory.
Other than the operator cache, the storage engine named Hummock on each compute node maintains the block (data) cache and meta cache. The data cache stores data. Unlike the operator cache, the block (data) cache stores everything in its binary/serialized format. And all the operators share the same block cache. The meta cache stores metadata. Hummock requires metadata to locate the data files it needs to read from S3.
We also track the cache miss ratio of these two caches:
We calculate the cache miss rate of the block (data) cache to be 9.52/401 = 2%
and the cache miss rate of the meta cache to be 0.203/90.2K ~= 0.0002%
.
We notice that the number of total lookups to the meta cache is much higher than the number of total lookups to the data cache. This is because every lookup into the storage requires going through the meta cache, but it does not necessarily access the data cache or remote object storage every time. The meta cache has a bloom filter to check if the data actually exists, reducing the number of remote fetches happening.
It implies that even just a small percentage of cache misses in the meta cache can induce significant performance overhead due to the large total number of misses.
Takeaway
Compaction is a background process that merges and reorganizes data in RisingWave’s storage engine (Hummock) to improve query performance and reclaim storage space.
As described in the CPU usage section, we can estimate the ideal CPU resources allocated for compactors by considering the LSM Tree Compact Pending Bytes.
This metric illustrates the amount of pending workload from the compactor’s perspective. Due to the bursty nature of the compactor’s workload, we recognize the urgency to make a change only if the pending bytes have remained above a certain threshold for more than 10 minutes.
Takeaway
Since the total pending bytes keep changing, we first calculate its average over a time period of more than 10 minutes. As a general rule of thumb, we then divide the average over 4 GB to estimate the ideal number of CPUs. If compaction is consistently lagging, it can lead to performance degradation. See Troubleshooting - Specific Bottlenecks - Compaction Bottleneck for more details.
Barriers are a fundamental mechanism for synchronization and consistency in RisingWave. Monitoring barrier metrics is crucial for identifying potential performance issues.
RisingWave by default generates a barrier every second and ingests it into the source operator (e.g., operators that read data from upstream) among regular input data. The barrier serves multiple purposes when it flows through each operator, e.g., triggering the computation of the delta between the current barrier and the last barrier, flushing new states into the storage engine, determining the completion of a checkpoint, etc.
In a perfect world, the barrier latency should stay at 1 second. But in reality, we may observe two phenomena in general:
Takeaway
We typically check them out first when we log into Grafana to diagnose any performance issues or even bugs. We further investigate which resources we need to increase once we run into the phenomena (1). Consistently high barrier latency is a strong indicator of performance problems.
Monitoring data ingestion metrics helps you understand the rate at which RisingWave is receiving data from upstream sources.
Particularly among stateless queries (e.g., simple ETL queries that transform data but do not involve stateful computation), we often find that RisingWave can be bottlenecked by the limitation of RisingWave’s upstream system.
For example, RisingWave may ingest data from an upstream message queue. Either the disk bandwidth of the message queue or the network bandwidth between RisingWave and the message queue is too low, the source throughput may not fully leverage RisingWave’s resources.
Takeaway
We suggest users also monitor the CPU utilization, disk I/O, and network I/O of RisingWave’s upstream systems, e.g., message queues or databases, to determine the end-to-end bottleneck. If source throughput is significantly lower than expected, investigate the upstream system for potential issues.
For any other questions or tips regarding performance tuning, feel free to join our Slack community and become part of our growing network of users. Engage in discussions, seek assistance, and share your experiences with fellow users and our engineers who are eager to provide insights and solutions.