Recovery failure

Overview

However, some issues might also cause recovery failure, leading the cluster to be in a constant recovering state and unavailable. This topic aims to provide you with information regarding recovery failure in the RisingWave cluster.

Possible causes

It’s important to identify the root cause of the issue. Some common reasons for recovery failures include:

Continuous OOM of Streaming Nodes.
Unconventional Streaming Node scaling down situation resulting in insufficient schedulable resources.
Network connectivity problems.

Continuous OOM of Streaming Nodes

How to identify:

When the Meta Node continues to enter the recovery state or when the actor keeps exiting during the recovery process.
Check if the CN node is continuously restarting due to OOM, refer to: Out-of-memory.

Two solutions:

Increase the allocated memory resources for Streaming Nodes.
Decrease the parallelism of the running streaming jobs or drop problematic streaming jobs.
1. alter system set pause_on_next_bootstrap to true;
2. Reboot the meta service, then the cluster will enter safe mode after recovery.
3. Drop the problematic streaming jobs or scale in them using risectl , refer to: Cluster scaling.
4. Restart the Meta Node, or resume the cluster by: risectl meta resume.

Unconventional Streaming Node scaling down

How to identify:

When the cluster is waiting for new nodes to join during the recovery process, such as when there are these logs: waiting for new workers to join, elapsed xxxs.
Check if the available parallel units of the Streaming Nodes in the cluster have decreased.
1. Check if the cluster has actively offline some Streaming Nodes.
2. Check if reduced CPU resource allocation for some Streaming Nodes and they have been offline for over 5 mins.

Solution:

Manually specify the parallelism of the Streaming Nodes and restart, parameter: -parallelism, or temporarily launch some Streaming Nodes to meet the requirements of parallelism.
After that to avoid some OOM issues, we’d better do some scaling: decrease the parallelism of all running streaming jobs or scale out from the temporary nodes.
Change back to the new parallelism.
1. If you specified the parallelism of the Streaming Nodes and don’t want to specify the parallelism manually to create streaming jobs every time, you can:
  1. Stop the Streaming Nodes.
  2. Unregister them in cluster: risectl meta unregister-workers --workers <worker_id or worker_host, ...>.
  3. Remove the parameter -parallelism and start the Streaming Nodes.
2. Offline temporary nodes.

Other: all these kinds of cases will be covered by the auto scaling feature that is on the way.

Network connection issues

How to identify: When there is a network connection problem between the Meta and Streaming Nodes, the recovery of the cluster will also continue to fail. You may find some logs like connection refused or error trying to connect: dns error: failed to lookup address information: Name or service not known. Solution: Please check the network configuration of the deployment environment and whether the operators of k8s are working properly.

Get started

Work with data

Deploy & operate

Performance

Troubleshooting

Reference

Cloud

Overview

Possible causes

Continuous OOM of Streaming Nodes

Unconventional Streaming Node scaling down

Network connection issues

Get started

Work with data

Deploy & operate

Performance

Troubleshooting

Reference

Cloud

​Overview

​Possible causes

​Continuous OOM of Streaming Nodes

​Unconventional Streaming Node scaling down

​Network connection issues

Overview

Possible causes

Continuous OOM of Streaming Nodes

Unconventional Streaming Node scaling down

Network connection issues