So, I run My own server. It hosts this blog. It also, amongst other things, hosts my e-mail, and a local network share. (I know, I should use separate servers, but I do use containerisation to keep a modicum of separation)

To ensure the integrity of the data I use a software raid array, 4 disks in a raid 6 set, it's not a backup, and I should know better, but it has served me well enough. There have been a few disk failures, and I've not lost any data (at least none I care about enough to look at regularly enough to know it's gone) through any of them. One of those disk failures I put in a new disk, but it was slow, and had occasional read errors. Annoyingly these prevented me from installing the grub boot loader on that disk. But that's ok, there were three more disks, it's not a major problem. Critically however, it also prevented me installing grub on any other disks. And so begins our tale of fail. Since that disk has gone into the array there have been disk failures, I can't be certain of the number (disks don't fail in easy to identify patterns) but I can be certain that it is more than two. At least one more than two. Because the server I have doesn't support hot swapping drives, rebuilding the array requiires a restart of the system. Restating the system requires a working boot loader. The last of the disks with a working boot loader failed recently. This left me with a system that wouldn't boot, and installing grub wouldn't work with the slightly faulty disk in the system. I was left with a system that I couldn't repair without putting the integrity of my data at risk, or a long wait for the array to rebuild using a boot disk (knoppix as it happens. I highly recommend having a copy available to anyone who does any sort of computer support). I chose the latter. So it takes a long time to rebuild 3TBs of data onto a shiny new disk. And so my website, my blog, my emails too, have been offline for a long time. I have now replaced the slightly faulty drive, as well as the failed drive. The array is rebuilding (again) onto the newest drive. I have ordered enough disks to have a spare on hand. And I have learnt a lot about the grub-install command's modules flag. I have also now got the motivation to not only fix the technical debt that caused me to not have a server at home for three days, but also the technical debt that means I'm hosting a server at home, and not on a hosting service (I know what I'm doing this weekend).

posted at 7:54 pm on 5 Aug 2016 by Craig Stewart

Tags:sysadmin fail embarrassing mistakes breaking oops