Hiccups in automating the new server

Jon Nagel

Oct 9, 2022 • 3 min read

There's some good and bad news with the new server....

Lets start with the good. I've automated 80%, got it stood up and have migrated most of my data over to it. All the important, everyday bits are being run from it instead of my old microserver.

The bad, well it's more like tepid, stale water. Lets go in order of things that went wrong or skipped.

The Bad

Kickstart

Yeah, the core of building the server got axed. Attempts were made, but for some reason I could not get the server to recognize the boot option, or when I did, it complained about issues that I did not have with testing on VMs. Instead of dwelling on this one, I've chosen to skip indefinitely. It was more of a nice to have than a need. the disk layout and config options aren't complex enough to have to worry about messing it up if I have to redo from scratch for some reason.

Snapshots

Yeah... Now we're crossing into breaking core designs, but the good news is, this is just skipped "for now". This is partially from originally wanting to do the base of the system as openSUSE and switching back to Fedora. openSUSE makes it incredibly easy to setup and use snapper out of the box. On Fedora, it's easy with a giant asterisk. It's the biggest pain in the world to automate on Fedora 36. Whatever you're thinking of to be the biggest pain, double it. There is a fairly straighforward guide to follow but I have not had luck automating it the way it needs to be done to get it to work.

The issue comes from the need to make very specific kernel modifications because of the filesystem defaults in Fedora. First, SUSE_BTRFS_SNAPSHOT_BOOTING=true needs to be added to the grub config, the boot kernel needs to be rebuilt to support this change. Next you need to change the subvolume ids for the /.snapshot volume so that you can control how the system boots. If these steps are in any way wrong (including trying to skip rebooting to set part of it), you lose access to the system.

I'm not skipping it, eventually I will get this part done, even if it's unfortnately done manually. Mainly, this is going to be needed when I start doing system upgrades, doing more ad-hoc work with containers, and backing up the VMs in the cluster.

Plex Harware Encoding

This one is confusing. Plex is being run inside a Podman container. Nearest I can tell, the official plex container doesn't play nicely with Podman. I mostly has to do with how podman does the hardware passthrough compared to Docker. Docker does passthrough of the device, Podman utilizes hooks into the device to share the abilities. For some reason when the device is successfully passed through, running nvidia-smi returns command not found. Nor does Plex see that there's the option for hardware encoding.

I need more time to investigate this, mostly at a time when Plex can be down and not interrupting anything (which sucks because that's how i've been watching most of my content lately).

Updates? Kinda

I'll be honest, I didn't plan exactly how I would be running updates on things. In order this is what needs to be done

Update the core system (security or all)
Update containers (currently Plex and Jellyfin)
VM upgrades (separation for cluster VMs and adhoc ones, but this is something I haven't spent any time figuring out since I haven't built the cluster part yet).

A week or so ago, I did make a general playbook to do the updates to the system, along with tags to differentiate between security updates and full updates that will have a reboot. That went mostly fine, but it did show me that I had an issue with the containers and the systemd files that were created. I haven't spent time figuring out how I would fix them, yet, but it's on my list of things to do before the end of the year.

The VM part, I have thoughts about how I'm going to update the cluster, but for other vms, I'm either going to use the same playbook for the core system, or I'm going to build in running updates via console somehow (I doubt it, I don't wand to deal with expect if I can avoid it).

The Cluster itself

This one is more of a cop-out than anything 😉. I have the files needed to make the xml domain definitions, but i haven't had the time to tweak them for what I need and then build out the variables. I should also be able to leverage some info from some playbooks at work if I get stuck. Again, it's more of a scope of time, than an issue getting it working.

It's slow going, mostly due to the time I spend working on it, but the system is building out mostly as planned.