Homelab Rework: Phase 1

Mon Jul 10 10:22:48 EDT 2023

Tags: homelab linux

Jun 25 2023 - Planning a Homelab Rework
Jul 10 2023 - Homelab Rework: Phase 1
Sep 15 2023 - Homelab Rework: Phase 2
Mar 02 2024 - Homelab Rework: Phase 3 - TrueNAS Core to Scale

I mentioned the other week that I've been pondering ways to rearrange the servers I have in the basement here, which I'm presumptuously calling a "homelab".

After some fits and starts getting the right parts, I took the main step over the weekend: reworking terminus, my gaming PC and work VM host. In my earlier post, my choice of OS was a bit up in the air, but I did indeed end up going with the frontrunner, Proxmox. While the other candidates had their charms, I figured it'd be the least hassle long-term to go with the VM-focused utility OS when my main goal was to run VMs. Moreover, Proxmox, while still in the category of single-vendor-run OSes, is nonetheless still open-source in a way that should be reliable.

The Base Drive Setup

terminus, being a Theseus-style evolution of the desktop PC I've had since high school, is composed generally of consumer-grade parts, so the brands don't matter too much for this purpose. The pertinent part is how I reworked the storage: previously, I had had three NVMe drives: one 256GB one for the system and then two distinct 2TB drives formatted NTFS, with one mounted in a folder in the other that had grown too big for its britches. It was a very ad-hoc approach, having evolved from earlier setups with other drives, and it was due for a revamp.

For this task, I ended up getting two more 2TB NVMe drives (now that they're getting cheap) and some PCIe adapters to hold them beyond the base capacity of the motherboard. After installing Proxmox on the 256GB one previously housing Windows, I decided to join the other four in a RAID-Z with ZFS, allowing for one to crap out at a time. I hit a minor hitch here: though they're all 2TB on paper, one reported itself as being some tiny sliver larger than the other three, and so the command to create the pool failed in the Proxmox GUI. Fortunately, the fix is straightforward enough: the log entry in the UI shows the command, so I copied that, added "-f" to force creation based on the smallest common size, and ran the command in the system shell. That worked just fine. This was a useful pace-setting experience too: while other utility OSes like pfSense and TrueNAS allow you to use the command line, it seems to be more of a regular part of the experience with Proxmox. That's fine, and good to know.

Quick Note On Repositories

Proxmox, like the other commercial+open utility OSes, has its "community"-type variant for free and the "enterprise" one for money. While I may end up subscribing to the latter one day, it'd be overkill for this use for now. By default, a Proxmox installation is configured to use the enterprise update repositories, which won't work if you don't set up a license key. To get on the community train, you can configure your apt sources. Specifically, I commented out the enterprise lines in the two pre-existing files in /etc/apt/sources.list.d/ and then added my own "pve-ce.list" file with the source from the wiki:

deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription

Importing Old Windows VMs

My first task was to make sure I'd be able to do work on Monday, so I set out to import the Hyper-V Windows VMs I use for Designer for a couple clients. Before destroying Windows, I had copied the .vhdx files to my NAS, so I set up a CIFS connection to that in the "Storage" section of the Proxmox GUI, basically a Proxmox-managed variant of adding an automatic mount to fstab.

From what I can tell, there's not a quick "just import this Hyper-V drive to my VM" process in the Proxmox GUI, so I did some searching to find the right way. In general, the tack is this:

Make sure you have a local storage location set to house VM Disk Images
Create a new VM with a Windows type in Proxmox and general settings for what you'd like
On the tab where you can add disks, delete the one it auto-creates and leave it empty
On the command line, go to the directory housing your disk image and run a command in the format qm importdisk <VMID> <imagename>.vhdx <poolname> --format qcow2. For example: qm importdisk 101 Designer.vhdx images --format qcow2
Back in the GUI (or the command line if you're inclined - qm is a general tool for this), go to your VM, find the imported-but-unattached drive in "Hardware", and give it an index other than 0. I set mine to be ide1, since I had told the VM in Hyper-V that it was an IDE drive
In "Options", find the Boot Order and add your newly-attached disk to the list
Download the drivers ISO to attach to your VM. Depending on how old your Windows version is, you may have to go back a bit to find one that works (say, 0.1.141 for Windows 7). Upload that ISO to a local storage location set to house ISOs and attach it to your VM
Boot Windows (hopefully), let it do its thing to realize its new home, and install drivers from the "CD drive" as needed

If all goes well, you should be able to boot your VM and get cracking. If it doesn't go well, uh... search around online for your symptoms. This path is about the extent of my knowledge so far.

The Windows Gaming Side

My next task was to set up a Windows VM with PCIe passthrough for my video card. This one was a potential dealbreaker if it didn't work - based on my hardware and what I read, I figured it should work, but there's always a chance that consumer-grade stuff doesn't do what it hypothetically should.

The first step here was to make a normal Windows VM without passthrough, so that I'd have a baseline to work with. I decided to take the plunge and install Windows 11, so I made sure to use a UEFI BIOS for the VM and to enable TPM support. I ran into a minor hitch in the setup process in that I had picked the "virtio" network adapter type, which Windows doesn't have driver support for in the installer unless you slipstream it in (which I didn't). Windows is extremely annoyed by not having a network connection at launch, dropping me into a "Let's connect you to a network" screen with no options and no way to skip. Fortunately, there's a workaround: type Shift+F10 to get a command prompt, then run "OOBE\BYPASSNRO", which re-launches the installer and sprouts a "skip" button on this phase. Once I got through the installer, I was able to connect the driver ISO, install everything, and have Windows be as happy as Windows ever gets. I made sure to set up remote access at this point, since I won't be able to use the Proxmox console view with the real video card.

Then, I set out to connecting the real video card. The documentation covers this well, but it's still kind of a fiddly process, sending you back to the command line for most of it. The general gist is that you have to explicitly enable IOMMU in general and opt in your device specifically. As a note, I had to enable the flags that the documentation says wouldn't be necessary in recent kernel versions, so keep an eye out for that. Before more specifics, I'll say that my GRUB_CMDLINE_LINUX_DEFAULT line ended up looking like this:

GRUB_CMDLINE_LINUX_DEFAULT="quiet iommu=pt intel_iommu=on pcie_acs_override=downstream"

This enables IOMMU in general and for Intel CPUs specifically (the part noted as obsolete in the docs). I'll get to that last bit later. In short, it's an unfortunate concession to my current hardware.

Anyway, back to the process. I went through the instructions, which involved locating the vendor and device identifiers using lspci. For my card (a GeForce 3060), that ended up being "10de:2414" and "01:00.0", respectively. I made a file named /etc/modprobe.d/geforce-passthrough.conf with the following lines (doing a "belt and suspenders" approach to pass through the device and block the drivers, an artifact of troubleshooting):

options vfio-pci ids=10de:2414,01:00.0
blacklist nvidiafb
blacklist nvidia
blacklist nouveau

The host graphics are the integrated Intel graphics on the CPU, so I didn't need to worry about needing the drivers otherwise.

With this set, I was able to reboot, run lspci -nnk again, and see that the GPU was set to use "vfio-pci" as the driver, exactly as needed.

So I went to the VM config, mapped this device, launched the VM, and... everything started crapping out. The OS was still up, but the VM never started, and then no VMs could start, nor could I do anything with the ZFS drive. Looking at the pool listing, I saw that two of the NVMe drives had disappeared from the listing, which was... alarming. I hard-rebooted the system, tried the same thing, and got the same results. I started to worry that the trouble was the PCIe->NVMe adapter I got: the two missing drives were attached to the same new card, and so I thought that it could be that it doesn't work well under pressure. Still, this was odd: booting the VM was far less taxing on those drives specifically than all the work I had done copying files over and working with them, and the fact that it consistently happened when starting it made me think that wasn't related.

That led me to that mildly-unfortunate workaround above. The specific trouble is that PCIe devices are controlled in groups, and my GPU is in the same group as the afflicted PCIe-> NVMe adapter. The general fix for this is to move the cards around, so that they're no longer pooled. However, I only have three PCIe ports of suitable size, two filled with NVMe adapters and one with a video card, so I'm SOL on that front.

This is where the "pcie_acs_override=downstream" kernel flag works. This is something that used to be a special kernel patch, but is present in the stock one nowadays. From what I gather, it's a "don't do this, but it'll work" sort of thing, tricking the kernel into treating same-grouped PCIe devices separately. I think most of the trouble comes in when multiple grouped devices are performing similar tasks (such as two identical video cards), which could lead two OSes to route confusing commands to them. Since the two involved here are wholly distinct, it seems okay. But certainly, I don't love it, and it's something I'll look forward to doing without when it comes time to upgrade the motherboard in this thing.

As a small note, I initially noticed some odd audio trouble when switching away from an active game to a different app within Windows. This seems to be improved by added a dummy audio device to the VM at the Proxmox level, but that could also be a placebo.

But, hackiness aside, it works! I was able to RDP into the VM and install the Nvidia drivers normally. Then, I set up Parsec for low-latency connections, installed some games, and was able to play with basically the same performance I had when Windows was the main OS. Neat! This was one of the main goals, demoting Windows to just VM duty.

Next Steps: Linux Containers and New VMs

Now that I have my vital VMs set up, I have some more work to do for other tasks. A few of these tasks should be doable with Linux Containers, like the VM I had previously used to coordinate cloud backups. Linux Containers - the proper noun - differ from Docker in implementation and idioms, and are closer to FreeBSD jails in practice. Rather than being "immutable image base + mutable volumes" in how you use them, they're more like setting up a lightweight VM, where you pick an OS base and its contents are persistent for the container. My plan is to use this for the backup coordinator and (hopefully) for a direct-install Domino installation I use for some work.

Beyond that, I plan to set up a Linux VM for Docker use. While I could probably hypothetically install Docker on the top-level OS, that gives me the willies for a utility OS. Yes, Proxmox is basically normal old Debian with some additions, but I still figure it's best to keep the installation light on bigger-ticket modifications and, while I'm not too worried about the security implications with my type of use, I don't need to push my luck. I tinkered a little with installing Docker inside a Linux Container, but ran into exactly the sort of hurdles that any search for the concept warn you about. So, sadly, a VM will be the best bet. Fortunately, a slim little Debian VM shouldn't be too much worse than a Container, especially with performance tweaks and memory ballooning.

So, in short: so far, so good. I'll be keeping an eye on the PCIe-passthrough hackiness, and I always have an option to give up and switch to a "Windows 11 host with Hyper-V VMs" setup. Hopefully I won't have to, though, and things seem promising so far. Plus, it's all good experience, regardless.

Uwe Brahm - Thu Jul 13 05:57:16 EDT 2023

I'd be interested if you managed to get a working Domino 12 or 14 server on your Proxmox server setup. Using ZFS could also be a benefit there.

Jesse Gallagher - Fri Jul 14 13:56:27 EDT 2023

I have an LXC container with Domino 12 running now - I copied the files over from the VM they were in previously, but I assume that a normal installation would work. It's a dev server, so I doubt I'll really try to turn the screws to see how ZFS could benefit me, but it's certainly nice knowing that's the underlying filesystem.

At some point, I'll likely make another container to act as a local replica of my primary production servers, and then I'll have more occasion to tinker with making it a "real" server.

New Comment

Name: Body: