Skip to Main Content

MongoByte MongoDB Logo

Welcome to the new MongoDB Feedback Portal!

{Improvement: "Your idea"}
We’ve upgraded our system to better capture and act on your feedback.
Your feedback is meaningful and helps us build better products.

Status Submitted
Categories Atlas
Created by Guest
Created on Oct 9, 2020

Add a grace period between a failure to obtain a certificate and the cluster shutting down.

If you set up encryption at rest on Atlas using an external key provider, such as an Azure Key Vault, then a link is created between being able to access the key vault and the status of the MongoDb cluster. For example, on the 28th September, there was an issue at Azure (SM79-F88) and for over 2hrs around 19% of all authentication processes in Europe failed. This included accessing the KeyVault used by our MongoDb cluster. As Atlas checks the keys roughly every 15 minutes it failed to obtain the key at the beginning of this period and our cluster was stopped. It was only restarted when Atlas could successfully authenticate and make contact with the KeyVault which was over 2 hrs later. I completely understand the need for this type of linkage and to shutdown the cluster when encryption is no longer valid. However, it is very fragile when there is the type of incident highlighted. What I propose is to introduce a configurable grace period between failing to obtain a key and shutting down the cluster. If this was available to us we could have weathered this Azure issue without any downtime as all our other infrastructure was running and accessible.
  • Attach files
  • Guest
    Oct 15, 2020
    Hi Glenn, As communicated to others in https://feedback.mongodb.com/forums/924145-atlas/suggestions/41578642-allow-customer-encryption-key-validation-time-inte Please accept our apologies for the availability consequences of the Azure outage you mentioned: You have my commitment that we are making changes on our side so that the Azure outage you mentioned does *not* in future lead to Atlas cluster shutdown--we will instead treat transient errors like this differently. -Andrew (VP Cloud Products)