Understand and monitor key performance indicators (KPIs) in RisingWave using Grafana dashboards.
streaming_parallelism
. By default, the parallelism is the same as the number of CPUs on compute nodes.
total lookups
metric denotes how many lookups a join operator performs per second, and the cache miss
metric denotes how many times the key does not exist in the memory and RisingWave has to fetch it from the storage engine.
In the case above, the cache miss rate is 707/10.8K~=6%
, which is quite low. Increasing memory may not do too much good.
Below is the same metric but for the aggregation operator.
658/2.45K ~= 27%
, which is relatively high. It indicates that we are likely to improve the performance if we increase the memory.
Other than the operator cache, the storage engine named Hummock on each compute node maintains the block (data) cache and meta cache. The data cache stores data. Unlike the operator cache, the block (data) cache stores everything in its binary/serialized format. And all the operators share the same block cache. The meta cache stores metadata. Hummock requires metadata to locate the data files it needs to read from S3.
We also track the cache miss ratio of these two caches:
9.52/401 = 2%
and the cache miss rate of the meta cache to be 0.203/90.2K ~= 0.0002%
.
We notice that the number of total lookups to the meta cache is much higher than the number of total lookups to the data cache. This is because every lookup into the storage requires going through the meta cache, but it does not necessarily access the data cache or remote object storage every time. The meta cache has a bloom filter to check if the data actually exists, reducing the number of remote fetches happening.
It implies that even just a small percentage of cache misses in the meta cache can induce significant performance overhead due to the large total number of misses.
Takeaway