Backup Infrastructure Redesign – Part 1: The Current Solution
While remote access is priority number one during the Covid-19 pandemic, it doesn’t mean all other systems can be left neglected, and this week I’ll be taking a look at updating my company’s backup infrastructure and design, which is currently in a very poor state to say the least.
I’ll preface this by saying backups are something I have always shied away from in previous positions, and therefore is something I lack both experience and knowledge in. Yes, backups are absolutely essential, a system administrator’s bread and butter, but every company I’ve worked for previously has had a solid backup system already in place, along with “that guy” who took care of everything backup related.
Also…backups are boring! There, I’ve said it…feels good to get it out there! There’s nothing glamorous or particularly interesting, in my mind anyway, about taking a regular copy of something for safe keeping. By its very nature backups are repetitive, steady, unchanging. If you get your kicks out of backups, I salute you!
So, this will very much be a journey of learning for me, which is why I thought it would be worth blogging about. Hopefully we’ll go from what is currently a loose and haphazard solution to something solid, reliable, and well designed. And that’s not to belittle whoever put the current backup solution in place. I’ve no doubt funding, time and resources were severely limited. We’ve also seen some rapid expansion to our company over the past 2 years which would have been hard to predict. We’ve simply outgrown a decade old backup design.
So just how bad is the current backup situation? Well, after a year in this company I’ve never been in a situation where I wasn’t able to restore a requested file, mailbox or server. However, I feel this has been down to good fortune as much as anything, and that it’s only a matter of time before we get bitten in the ass due to failed, sparse, or just non-existent backups.
So what is the current design like? It’s very simple:
- A hospital environment with approximately 24 ESXi hosts/150 VMs (our vSphere design is a whole other story)
- About half of these are backed up with Veeam
- The remainder are backed up with Synology Backup for Business
- Our on-premises Exchange environment has a separate backup solution, Backup Exec
- All three solutions writing to a Synology NAS.
- Our critical medical system has it’s own isolated and dedicated backup, which won’t be a part of the wider business backups, and therefore won’t be touched in the redesign.
So let’s break down the many problems here, not in any particular order of severity:
- We run a very old version of Veeam, 8.0, which is out of support. We are currently unable to upgrade due to later versions not supporting the backup of ESXi 4.1 hosts (I did say our vSphere environment was a whole other story…).
- Synology Backup for Business is a product that comes bundled with Synology NAS products. It’s “free” in terms of not needing a license per socket/host like Veeam, but I have some serious reservations about its reliability as an Enterprise backup solution.
- We have three backup solutions in our environment (four including our main clinical system), which causes confusion at restore time, and makes backups difficult to manage on an environment-wide level. I want to unify this into a single solution.
- We are writing to a single RAID 5 NAS. There is no secondary backup, and no off-site backup.
- We are at critical capacity, regularly hitting 1% remaining space on the NAS, with no slots available to expand.
- There is no thought put into when backups run. As tasks are added or modified, schedules are chosen on a whim.
- There is no uniformity or thought put into the retention policy of each backup task. It is practically random, and again chosen on a whim. Some tasks will retain 3 restore points, some 14, for no apparent reason.
- There is no archive copy taken of any VM. Incremental forward backups only. We can currently not restore any VM past a couple of days.
- The NAS backup repository is on the same campus, on a separate network with 10G connectivity from the main server room. However, the majority of tasks do not use this 10G network, instead going over the main server VLANs.
- There is no official agreement from management about the expected RPO or RTO for restores.
- There is no tiers of backups; our critical business servers are treated in exactly the same way as test servers when it comes to backups.
- All backups target individual ESXi hosts rather than vCenter. This overlaps with another issue we have of getting older ESXi hosts migrated to versions that will support integration into vCenter, but that’s for another day!
- There is no verification of backups or test restores done to confirm backup integrity.
I’m sure for anybody reading the above, many of these points will come as an absolute shock. But that’s the situation I’ve inherited, and now finally have some time to address, ironically in the middle of a pandemic.
So, barring any serious issues related to remote access during Covid-19 that will need to be addressed immediately, I hope to dedicate much of next week to researching a suitable backup design, coming up with a plan to put it in place, and executing it over the coming weeks.