Skip Ribbon Commands
Skip to main content

Cory's Blog

:

Quick Launch

Stenoweb Home Page > Cory's Blog > Posts > Adventures in Disk Failure
September 23
Adventures in Disk Failure

For as often as I talk about data protection and backup – it's worth noting that I still haven't actually bought anything or deployed anything. In fact, more data is moving onto TECT all the time, and there's still nothing actually protecting it once it's there.

Fortunately, I have a certain level of fault tolerance, but I often feel that it's important to talk about the fact that fault tolerance is not a backup. Backups can have fault tolerance, and having fault tolerance on a server is a very good thing, but it's not the same as having a backup.

Recently, I (sort of) found this out with TECT. Fortunately no data was lost, but I got home one day this past week to find out that two weeks before, one of the disks in my VM data array had fallen offline, and for two fairly active weeks on the server, it was running in a degraded mode with no remaining fault tolerance.

I consider it a very big and important stroke of luck that I happen to have looked in on the server's base OS and look at the hardware administration application. This isn't something I do on a regular basis, in part because a very large amount of the time, when I do look, nothing exciting is happening.

I went out to the server immediately to have a look, and sure enough, it was flashing angrily at me. I pulled the disk and, having nothing better to do with it at the moment, put it right back in. While checking online for current disk prices, I decided to check the administration tool again. What luck I have – the disk had been recognized and automatically started rebuilding itself.

I went ahead and put everything I was doing away and hopped on the bus over to Best Buy to grab another one of the newer Seagate disks I've been buying. I have that new disk still in its box hanging out in my room, just case something should happen to one of the other disks.

This has been an interesting reminder of a few things I honestly already knew – learnt at least in part (and possibly most creatively) from The Tao of Backup. Backups aren't all that's at play here, however.

Before anything I write gets interpreted as "you don't need backups" – this couldn't be further away from the truth, and I'm still very much looking to implement some kind of backup system for TECT that relies on more than luck. However, I have definitely been focusing almost exclusively on backups lately, which is great from a data preservation perspective, but can create a situation where all of my data is protected, and I need to use those backups prematurely, because my RAID data store has gone down, or something else has happened to the server because I wasn't paying attention to any other aspect of computer management.

The question now is what kind of monitoring I should be doing, and how that information should be delivered to me. To be honest, I'm not that crazy about monitoring. Because I would like to set up my own Exchange server at some point in the near future anyway, what I may end up doing is setting OpenManage Server Administrator (and the alerting function of any given backup utility I end up using) to send messages to my inbox on the Exchange server. This will get me the information I need on an often enough basis, without taking undue effort or spending a lot on specific server monitoring hardware.

Having a machine like TECT is always an interesting adventure. When I first bought it, I definitely saw it as a very big PC, which was going to be able to run virtual machines easily. It would have been possible to use it this way, but what I ended up bringing up is a miniaturized server virtualization infrastructure. It works very well for what I need to do, but I scaled up storage (and therefore backup requirements) and located it "remotely" (in the garage) – which force server-type monitoring and backup solutions.

One of the take-aways for me is that when you're running a large system, you need to be prepared to treat it like a large system. It's just unfortunate that most of this stuff is designed for corporations or larger organizations having at least a dozen or so servers. The environment I've created is fairly unique, and honestly, if I'd started one year later, when Sandy Bridge desktops were released – I might have gone in that direction instead, for several reasons. Now that both Sandy Bridge and Ivy Bridge are out, that's definitely the direction I tend to recommend people now looking for powerful computers go in.

Comments

There are no comments for this post.