The Final Nail in the Coffin of a Fully Virtualised Layer 2-7 Home Lab
Every year or two I get the urge to build my ultimate virtualised home lab. A lab that would let me run enterprise applications, on Windows and Linux VMs, hosted on multiple virtualised ESXi hosts, shared across vSAN storage, interconnected with emulated multi-vendor routers, switches and firewalls, with DMZ, internet and cloud connectivity…basically, everything from Layer 2 upward, emulating a full enterprise environment, all running on a single box…
…ambitious, and probably a bit naïve! Though I feel like this time around I got very close. In doing so though, as much as an environment like this would be incredibly valuable to me, I realised it’s just not worth the time and effort to get everything playing ball with each other.
I thought I’d document how far I got before I burn it all to the ground, detail a few of the technical aspects, and explain why I won’t be bothering to try this ever again.
The basics of the lab are:
- A single “core” physical ESXi server hosting everything
- GNS3 VM, providing switching and routing for the ESXi hosts/VMs, as well as the bridged interface out to the physical home network/internet/cloud
- 3 x Nested ESXi hosts for VMware lab, and hosting server/desktop VMs
All of the above are perfectly stable and useable in isolation; GNS3 is fantastic for networking labs, nested ESXi hosts have been a great way to lab VMware products for a long time now, and building out a virtualised Windows/Linux lab environment is standard stuff.
The problems begins when trying to get all these elements to network though GNS3, and that’s what this post is going to detail.
So it didn’t work then?
TD;DR, how much of my dream home lab did I achieve? I got to the point where I had multiple VMs on different VLANs, hosted across multiple ESXi hosts, successfully able communicate (RDP, ping, file browse) with each other across a Cisco IOSvL2 switch hosted on GNS3 VM, with independent links on each ESXi host that could be disconnected to simulate and test failover. All running on a single server.
It was a struggle to get to that point, but once I saw those successful pings going out and back through GNS3, I thought I’d finally cracked it. From that point, it would just a case of expanding VMs and additional GNS3 appliances, and most likely adding some juice to the core server…
That was until I tried to transfer files from VM to VM as a bandwidth test. The moment I drag and dropped a file into Windows File Explorer on a remote VM, the whole lab network instantly dropped.
In short, some applications and protocols such as RDP, ping and file browsing worked fine. Others, such as file transfer and domain joining a computer, broke the lab’s entire network, and required the GNS3 client to be restarted to bring the network back online. This essentially made the whole lab environment unusable for its purpose.
And that’s where I am with it today. Close, but no cigar. Below are some technical details on the setup to get to that point.
The core server is an old IBM System x3650 M3 that was going to the scrap heap at work. 128GB of RAM, Xeon E5645 processor (12Cores/24Threads), 2.2 TB of SAS 10K disks. Not the most powerful of servers, but enough to let me test this lab configuration out, and somewhat expandable if need be. Running ESXi 6.7, with two physical network connections out to a firewall for future routing to the physical network and outside world. Any reference to “core” in this post is referring to this server.
The core server is running 3 x nested ESXi 6.7 VMs, named “ESXI-A/B/C”, identically configured for 12 cores, 32GB RAM, 500GB of storage. Again, not particularly powerful, but enough for testing the feasibility of the lab. Each of these hosts has a Windows server or two running on it for testing purposes, and a single interface onto a vSwitch which connected to my home lab network, for direct access to their management interfaces from my PC.
So far so good, the hosts are manageable from my home network and VMs can talk to each other without issue. This is a perfectly fine VMware and Windows/Linux VM lab.
Introducing GNS3 to the Equation
The next step was to get the ESXi hosts routing all traffic to GNS3, where I could switch/route using vendor specific networking devices. As in the previous screenshot, I am using the GNS3 client connecting to a GNS3 VM hosted on the core ESXi host to create the networking topology.
I created a simple GNS3 native Ethernet Switch, which connected to a GNS3 Cloud connector. The Cloud connector essentially provides connectivity to all interfaces configured on the GNS3 VM. There are 9 additional interfaces added to the GNS3 VM, as below:
Why 9 additional interfaces on the GNS3 VM? I wanted each nested ESXi host to have independent uplinks that mimicked a real cable, for the purpose of experimenting with ESXi NIC Teaming and general failover scenarios. Just like with real hardware lab, I wanted to be able to disconnect an uplink and observe the results. Almost like a virtual Layer 1!
To achieve this, none of the nested ESXi hosts can share a vSwitch on the core ESXi host, instead linking each interface independently to the GNS3 VM. Each “cable” is simulated by an independent Port Group and vSwitch on the core ESXi host, creating a bridge between the nested ESXi’s interface and the GNS3 VM’s interface (hence why an additional 9 interfaces are required).
Below for example shows a single uplink between an interface on ESXI-A and an interface on the GNS3 VM:
There are three such links per host. Below shows all vSwitches and Port Groups on the core ESXi host.
As a naming convention, PGx signifies the Port Group uplinks (3 on each host), and the letter A-C signifies the host, for a total of 3 hosts each with 3 independent uplinks.
Originally I believed the GNS3 VM was limited by vSphere to 10 Network Adapters, which therefore limited each host to 3 interfaces (plus 1 for connectivity to my home network for management). However, looking at it again now, I believe you can add an additional virtual controller to the GNS3 VM if you wanted more than 10 connections.
The only other active vSwitch on the core host, “VM Network”, takes the last interface on each ESXi host, to link the host onto my home LAN for management purposes.
Below shows how the Network Adapters and Port Groups appear on the the GNS3 VM:
With below demonstrating the matching Port Groups on an ESXi host (ESXI-A as an example):
Below shows my final chart for tracking how each vmnic/Port Group/vSwitch connects between the nested ESXI hosts/core ESXi host/GNS3 VM, which will hopefully make it clearer how everything connects. Keeping a chart like this is essential, otherwise you’ll drive yourself absolutely batshit crazy trying to keep track:
This setup worked perfectly. I could have a VM on ESXI-A ping a VM on ESXI-C. Disconnecting one of the uplinks on ESXI-A would realistically result in the traffic being moved over to another NIC in the NIC Team. Disconnecting all the links would drop the ping completely. Bringing any of the links back online would immediately start the ping again. Latency was a solid 1ms. Perfect!
A small caveat with disconnecting uplinks was that within the vSwitch topology screen of the nested host it will still show each physical adapter as up, even when disconnected:
Setting Load Balancing to “Route based on originating port ID” allows the nested host to detect when a link has failed, as below:
You’ll also notice from the above that “Allow promiscuous mode” and “Allow forged transmits” are set to Yes on each of the core host’s vSwitches that bridge the GNS3 VM and nested ESXi hosts, which is a requirement to allow traffic to flow between them.
Next was to introduce VLANs.
The trick to get a trunk working from your ESXi host through to GNS3 is to simply set each of the Port Groups on the core ESXi host to VLAN 4095, which allows all VLANs. You can then set up your Port Groups and switch trunking as you normally would (using the GNS3 native Ethernet Switch for testing, that was just a case of enabling 802.1Q on the trunk ports). This worked perfectly; creating a VLAN 10 port group on ESXI-A and ESXI-C for example allowed VMs on that VLAN to ping each other.
I was pretty happy with this, as one of the key thing I wanted to achieve was for the management of the nested ESXi hosts to behave exactly as if they were “real” ESXi hosts. I didn’t want the upstream shenanigans on the core host and GNS3 VM to force extra steps when creating VLANs for example.
And so I thought I was set. I had independent uplinks, VLANS working, and a transparent link through GNS3 with 1ms latency. I had the core ESXi host essentially invisible to the whole process, so the work of creating VMware and Windows labs would never involve any further configuration on it, all work would be done connecting directly to the management interface of the nested ESXi hosts (or the planned vCenter!).
But inevitably I hit a snag, one I no longer have the will to overcome!
I can ping across the GNS3 Cloud connector. I can remote file browse, and copy small files. I can even connect via RDP between VMs. But any file transfer over a certain size, which through testing appeared to be approx. 10K, will instantly stop the lab’s entire network, with a restart of the GNS3 client being required to bring the connection back up. In that state, the lab is unusable, as of course most applications will at some stage make a transfer of data over 10K! Domain joining a computer for example fails and breaks the network.
The GNS3 forums pointed me towards a possible MTU issue, and so I’ve tried experimenting with MTU size across all points in the traffic’s path; the Windows VM interfaces, the nested ESXi vSwitches, the “core” ESXi vSwitches, the GNS3 VM interfaces, but still with the same result.
Where to go from here
And that I feel is the nail in the coffin for the dream of a fully virtualised lab.
I do believe I could eventually get the file transfer to work, I haven’t dug too much into the issue, but the whole process has demonstrated to me why chasing this setup is simply a bad idea.
It would introduce a huge and quite unpredictable variable to everything I’d do in this lab environment going forward. It would always be at the back of my mind as something that might be interfering. Imagine troubleshooting an issue for hours at the VMware or OS layer, only to find the issue all along was another quirk or bug in GNS3 that you’d never encounter in a real-life scenario. And what other bigger issues lay on the horizon?
And that’s not to slight GNS3, it’s an amazing product, but one that should be used for its intended purpose, or at least with very limited VM integration (the in-built QEMU emulation for VMs from within the GNS3 topology is slow but definitely useful for small environments).
There is value in improving your general troubleshooting skills by deep diving into these kind of challenges, but at this stage I’d prefer to spend that time actually learning about a technology, rather than fighting with the lab it’s running on.
And so, for that reason, I’m out! The only solution is to find the space, money, and relationship compromises to build out a full rack hardware lab to fulfil all my home lab needs!