I started writing this last weekend, but the time I spent on fixing the server and building shelves meant I didn't have an awful lot of time left over for writing.
This weekend, my housemate was out of town for a day or so, so I figured I would take advantage of the quiet time by patching some virtual machines and then the host machine. I approved an update to the backup software and then started patching one of the VMs and then the entire system fell over and didn't come back for around an hour and a half until I looked.
My house is still very much in a state of flux. A few months ago, I moved the desk that used to be in the office space at the front of the house into my bedroom, and then I started getting sick, so I had to stop making changes. I had to clear some stuff off a table I put there, which had been filled with stuff my housemate put there so she could use the other table I had stuff on, and then I started fiddling with the machine. As I moved some more stuff around it became clear that I needed to address the problem of the arrangement of the area.
I shuffled some stuff for the night and set about researching the problem. It turns out the error I was seeing happens for basically any number of totally random reasons, but my closest hint was a result I found suggesting it happened to somebody under high IOPS load. One of the disks on the machine had totally died and the cache battery is in questionable state, so there was my answer.
I couldn't get that disk to re-appear so I went to bed and headed out the next day to get a disk. I bought two disks and some other computing sundries I've been meaning to get at the local Staples store. I'm happy they had what I needed, and what they had is one of the few disks that matches my needs almost exactly. It's a newer manufacturing revision of the same disks I have been putting in my server since 2011 or 2012.
It's interesting to think about because while these disks aren't the worst value in terms of storing a lot of data, but they're not very good. I received a suggestion to just upgrade the RAID card and get newer, bigger disks. Not a bad recommendation entirely, but not viable right then since I was trying to solve a problem getting the machine to even run.
I got home with the new disks and started to get ready to unpack them and then noticed it.
I already had a disk labeled as having been bought just around a year ago, in August 2017. Probably from the previous time a disk dropped entirely. I tend to buy them in twos. So I installed the existing disk and labeled the next two. Put it in and started the rebuild. I left it on the PERC screen to do that.
Just in thinking about the trouble I've had getting disks over the past year or so, I know this isn't the first time I've come to this conclusion but "small" internal spinning hard disks are basically a legacy technology at this point. On the other hand, my controller can't go above 2TB disks, otherwise I would be looking at putting a few bigger disks in RAID 1 or 10 and perhaps adding an SSD or two.
And really at this point other than a few weak points such as only having one power supply, the biggest problem with the server is IOPS.
I replaced the disk and was able to get the machine running, but because I spent most of Sunday waiting for the rebuild and my roommate was extremely excited to get back online, I just brought the machine back up and haven't since had time to do the patches I originally wanted to do. Fortunately, my roommate stepped out last night and I was able to get a few things patched back up. I've left more of the virtual machines turned off for the past week in hopes that running fewer virtual machines will have things be more stable, at least until I can get a few of my heaviest VMs onto some solid-state disks, or I can reconfigure things to be a little more performant.
Every now and again I think about splitting my virtualization workload back onto smaller machines and making things a little more manageable in terms of disk IOPS is another reason to do it. Even things like entirely re-physicalizing servers comes up from time to time.
This gets at a deeper issue with the TECT setup. When I started working on it in 2010, it was essentially a miniaturized version of what we now refer to as hyperconverged. For convenience and to save a little bit compared to buying two or three servers and either using one as an iSCSI target or just distributing VMs between them randomly. Having a single big system reveals weaknesses in the way I chose to set up the system. The big RAID 6 array is particularly badly suited to a virtualization environment, especially when Windows is involved, and doubly so when any of the virtual machines get desktop-style usage.
The machine was purchased at a time when I knew I'd have increasing data storage needs over the years and 2TB disks had just started to exist. I chose the machine I have because it was the best way to get eight disks into a single machine. To move forward with the configuration, I need a new RAID controller, new disks, and ideally some number of solid state disks to create a mix of storage areas suitable to different needs. How it will all be set up is a detail for later, but as always it's important to remember that backups are a concern as always. The software I'm using today doesn't support different backup sets, so I couldn't, say, run a backup of the VMs on one disk to a first external disk and then a backup of the VMs on another disk to a second external drive.
Ideally, I would decide what I wanted to do and then build out the backup system for it first. That almost certainly means choosing and installing a USB 3.0 or better controller and then picking up a big disk, or perhaps a NAS or something along the lines of a Drobo.
It's an issue I've known for a little while and I know the solutions to it essentially involve dropping a lot of money on a new controller and new disks, I just need to, you know, do that and get it over with.