Everything Fails

Everything is horrible!

Wait, that’s not the message I want to send at all.

I wonder if these guys have an HA/DR plan...

I wonder if these guys have an HA/DR plan…

Planning for Failure Should Be Comprehensive

Think about the last time you thought about high availability and disaster recovery…

You’re lying, nobody ever thinks about HA and DR. Not until something is already on fire, at least.

Now, pretending you did think about HA and DR at some point in the distant past, how far down the rabbit hole did you go? Were there two servers? Did each server have redundant NICs? Power supplies? Were you using RAID? Did you think about the UPS?

Every component in the system needs to be considered when you’re looking into HA and DR. Using an AlwaysOn Availability Group, clustering, or database mirroring isn’t enough – there’s more to it.

Failure Has Consequences

Let’s use a specific example instead of talking in the abstract.

We’ll assume that you’ve decided those super fast consumer grade SSDs are the way to go. You’ve planned the rest of your deployment. You’ve got an AlwaysOn Availability Group. You’re ready to go. Right?

There’s still one more thing to talk about – power. See, most of those consumer grade SSDs don’t have any kind of battery in them. And, as you might know, disks lie. So we can’t really be sure if our writes are actually permanently stored somewhere unless we safely shut down the computer. Which always happens when the power goes out, right?

In this particular case, we need to keep worrying about power – what happens if the power fails? Is this server connected to a UPS? What happens when the UPS kicks in? Is there a backup generator? Will the server stay on? Can the server be automatically shut down? What’s that look like instead?

Ask Awful Questions

Being prepared has everything to do with asking yourself terrible questions. Work through the entire stack and come up with as many ways for things to fail as you can. Explore how you’d prevent these scenarios. You can’t provide a mitigation for everything that you come up with, but it’s good to think of these things.

Once you’ve got your List of Awfulness, work the feasible things into your HA and DR plans. Make sure that you’re covered as best as you can. Sometimes it makes sense to sweat the small stuff.


 

Explosión” by kinojam is licensed under CC BY-NC-SA 2.0

1 Comment. Leave new

  • Does the backup generator have enough fuel? Will anyone get alerted if it kicks in and the fuel starts running out? Who is responsible for topping it up – you/facilities manager/security guard?

    We had that one at our DR site; localised power failure, backup generator kicked in, ran out of fuel, nobody notified because DR didn’t have same level of alerting as Live, took days to get our Oracle db back in usable shape.

    Dread to think what would have happened had we had an outage at our Production DC with no DR site to failover to…

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Menu