VM Restore Testing: How Thorough Should We Be?
We all know regular testing of backups is critical, but how can we verify the integrity of our backups, beyond simply trusting the output of our backup software or booting up into the OS? How far do we need to go to prove that our backups are reliable and trustworthy?
This is relatively straightforward for file and email restore; with regular scheduled test restores we can verify that previously deleted items are present when a restore is complete. You might take this a step further, by including colleagues from different departments to suggest test files/shares/mailboxes to restore and confirm content is as expected. This also has the added benefit of getting sign-off on backup integrity outside of yourself and the IT department.
But things become a little trickier with Virtual Machines.
Recently, I had to do a restore of our PBX management VM. The restore appeared to go well; it was complete by 7am with no errors, and throughout the entire day there was no issues reported. Problems only began the following day, when the MySQL service failed early morning and continued to fail a few seconds after being restarted, causing severe disruption to our call centre. On the suggestion of our vendor support, we restored from an older backup. Again, the VM worked fine throughout the day, but again MySQL failed the following morning, and we were back in the same position. We managed to resolve the issue with an offline cold migration of the VM, but it was clear that our backups of this VM were corrupt and unusable. While we are still investigating the root cause, it’s most likely that the backups were not backing up the MySQL database in a consistent manner, or some key process was running during the backup window.
We do monthly testing of our backups, and this particular PBX server was one we had tested a few months prior. But our testing only involves review of restore logs, bringing the VM online on an isolated network, and connecting in to confirm everything looks ok at an OS level. Is this enough? In this case it proved not to be. How could we have identified a problem with a backup that only manifested itself 12-24 hours after a restore?
Backup Software Verification Features
Backup solutions will generally include features to automatically verify the integrity of a backup. Using Veeam as an example, because it is the solution I’m most familiar with, it includes an option for a health check to be performed on a backup once it is complete:
This feature compares CRC values of backup metadata and hash values of data blocks on the VM’s disk to verify that the last restore point taken is consistent.
This is great to confirm that there has been no corruption at a block level during backup transfer or storage, and should be enabled where possible, but it is unlikely to help for any application level issues, such as a backup being taken when a service needs to be first stopped.
More Advanced Testing
Veeam also includes SureBackup, an integrated feature that automates the restore, mounting and testing of a VM in an isolated sandbox. The tests include:
- VMware Tools heartbeat testing to confirm OS has booted
- Ping test to confirm network connectivity
- Application testing; predefined scripts are provided for domain controllers, web servers and mail servers. This generally takes the form of confirming a network port is open, or in the case of a SQL server, confirming connectivity to instances and databases.
- A generated report confirming the above:
While the above is useful, the tests are very basic and will only highlight immediate and obvious issues with a restore. It is again unlikely to help us identify any issue hours afterwards.
Expanding on the above though, is a feature of SureBackup to run custom scripts against the sandboxed restored VM. If you are familiar with the application you are restoring and handy with PowerShell, this could be an excellent way to directly test the functionality of the application, such as creating a test user in a restored domain controller or creating a test mailbox on a restored Exchange server.
But still…would this have helped me last week identify a MySQL service that failed 12 hours later in the small hours of the morning? With hindsight, we did discover a hacky test that triggered the MySQL crash manually, and with further investigation from vendor support there may have been a MySQL test application we could have automated as part of the above custom scripts. But without this hindsight, no obvious test comes to mind, particularly knowing that the application/database worked fine for 12-24 hours after restore.
Restoring to a Test Environment
The final solution, which is both the most time consuming to perform and most expensive in terms of setup, is to restore to a fully functioning test environment, confirming functionality over the course of a few days with as much “real-life” application testing as possible, while still keeping the VM isolated from your production network.
This would 100% have highlighted the issue for us; even with light testing of the PBX VM over the course of a few days, it would have come apparent the day after test restore that there was serious issues.
It’s rarely possible to do this kind of verification with a single isolated VM; for our PBX, there are dependencies on associated VMs, soft switches, connected phones, and a DC for authentication. As nice as the Veeam sandbox is, I don’t believe a test to this scale would be possible (happy to be corrected!), a dedicated test environment is the only way to go if the system is to be tested regularly.
Of the above, for the majority of VMs, simply confirming that a VM can be restored and booted up is enough. Backup products that provide basic automated connectivity testing are very useful, and if they don’t impact backup times too badly, should be considered for any moderately important VM. For critical VMs, a dedicated test environment should be considered, particularly one that can be left permanently running with some underlying infrastructure (such as Active Directory), which VMs can be restored to and left for testing for as long as required. Of course, all this is heavily dependant on the size of your organisation and the backup SLAs you will be held to.