Force restart of unhealthy node

Right now, there's a "test failover" option, which shuts down the primary and forces an election. However, the option is only available if the cluster is in a healthy state. If, for whatever reason, the cluster is unhealthy, it's impossible to manually restart the primary. It should be possible to force an election in an unhealthy state. Often, this is all that is required to get back into a healthy state (e.g. if the primary is in a CPU burning loop that was caused by an unexpected write pattern that has stopped.)

Attach files

Enter a subject

Guest

Jul 10, 2025

We've been waiting 18 hours for support (which we are paying good money for) to reboot our cluster after a large operation caused the dirty cache fill ratio to jump to over 20% which completely CPU locked a secondary node evicting pages. We can't scale it up ourselves, we tried and it is simply stuck and the operation failed. It's completely ridiculous that the best they can suggest for self service is to "Test Resilience" which causes a primary failover which their UI blocks you from doing in many cases (such as the one we currently find ourselves in). I will never recommend Atlas again until this is resolved. This suggestion has been open for nearly 5 years so I wouldn't hold my breath.

Reply
Hide replies
Like

Guest

Dec 27, 2024

We have faced this problem many times in our production server, no way to come out of 100% CPU burn problem unless someone from support team restarts our server for us. Causing almost 2-3 hours of downtime for our servers.

Reply
Hide replies
Like

Guest

Oct 25, 2023

We also have this issue, where if the primary gets loaded, scaling doesn't have capacity to scale, and it just gets stuck, and you have to call support while your application is effectively down or struggling. We've just scaled up our instances to overcome this, but it's a massive waste of resource to keep them scaled up so much because scaling can't respond under load and isn't configurable. We've started to look at other DB solutions with more effective scaling. It's unbelievable that 'dark mode' for the UI is being worked on, while critical issues with scaling that cause outages are not.

Reply
Hide replies
Like

Guest

May 12, 2023

We have faced this issue multiple times when the primary got loaded and tried to upscale the instance its become unresponsive, this time we need to take help from the support team which is again process based task, if we have control to restart the node, it would be faster than what we are facing right now.

Reply
Hide replies
Like

Guest

Oct 29, 2020

This is a badly needed feature. The only solution to force an election is to contact Mongo support and wait upwards of 2 hours so that they can force a restart of the process on the unhealthy node. This has happened several times since we've started using this service and it's getting to the point now that we may need to start looking at alternatives because we might lose customers due to a lack of confidence in the system being available.

Reply
Hide replies
Like

Guest

Oct 15, 2020

This is definitely a useful feature that should get implemented. Having to wait for support to take care of restarting a faulty node increases MTTR which could have a huge impact in averting a disaster or at least mitigating it quicker. Please consider implementing this.

Reply
Hide replies
Like

Please enter your email address

RELATED FEEDBACK

Force restart of unhealthy node