Automating the creation of the virtual Kubernetes cluster - Part 1

Progress has been finally made with the next step of building out my new server.

It took a while to get to this point, mostly because there's no clear documentation to do all the things that I wanted to piece together.

Let's take a recap of what I'm trying to do; A year ago I got some new hardware to build a new server to replace my existing one, and a couple of months ago I had a few hiccups in finishing the build-out. This server is to replace my slow and extremely long in the tooth HP Microserver, and to act as a local server and learning playground.

Next up has been figuring out what was needed to automate the creation of the VMs for the Kubernetes cluster (as well as any other VMs I plan on deploying). Now, there's a fairly well documented set of tools that I already used previously for setting up the libvirt bits. I already had gone and set up the storage pools and the networking to allow DHCP on my LAN. Unfortunately, that's where the documentation starts to fall apart. To do the installation of a VM, you use the community.libvirt.virt module.

Here, you do the following:

  1. Have a disk already created, and know the path to it.
  2. Any additional resources you need created (network, shared disk, etc.)
  3. Have an XML domain definition file you can feed to the playbook

That's all the information that's given. 1 and 2 are pretty obvious, it's a similar situation to deploying a cloud VM with something like terraform. Unfortunately, this means, if you don't know how to make a minimal domain definition file. This gets even more obfuscated once you try to also do this while using Fedora CoreOS as the guest OS.

That changes today!

No seriously, I'm going to walk you through how to make a bare-bones definition file, what you need for it to have Fedora CoreOS as the OS, and a bare-bones ignition file to build your VM.


Libvirt's Domain Definition

I unfortunately was not able to find any real “here's a real basic XML file you can use” or “here are the bits you require in the file for it to truly work properly” for the VM definition. There is the whole fat domain specification, but that's too much for starting out. I did find a page that was “here run this role that contains this file”, but that's not good enough for me. So, let's take a look at what I used, and break down what I can.

Let's work our way down.

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
  <name>{{item.node_name}}</name>
  <metadata>
    <libosinfo:libosinfo xmlns:libosinfo="http://libosinfo.org/xmlns/libvirt/domain/1.0">
      <libosinfo:os id="http://fedoraproject.org/coreos/stable"/>
    </libosinfo:libosinfo>
  </metadata>
...
</domain>

<domain> is the top-level tag everything will live under. <name> is going to be the name of the domain (VM). It's how you would reference with the virsh command. <metadata> tag is a little less clear, the docs point to it being used depending on what application is creating the domain.

<memory unit='MiB'>{{item.vmem}}</memory>
<vcpu placement='static'>{{item.vcpus}}</vcpu>

<memory> is straightforward. It's the allocation for the memory being made available to the VM. unit can be changed between KiB, MiB and GiB (maybe more, but I don't have enough ram on the host to test that 😉) depending on your needs. I still prefer to do it with MiB for the most part. <vcpus> is also straightforward, in practice. If you want the bare-bones usage, the way I have it with placement set to static and then define the number of cores to be available. If you're looking to manage NUMA layouts, core pass through, I strongly recommend checking this section of the specification.

<cpu mode='host-passthrough' check='none' migratable='on'/>
<os>
  <type arch='x86_64' machine='pc-q35-6.2'>hvm</type>
  <boot dev='hd'/>
</os>

<cpu> is more for when you want to emulate different versions of CPU from what you're running. For what I'm using, host-passthrough is the basic mode, it says “send all the features available to the host CPU through”. This is the section that you would use to define if you wanted to emulate an ARM CPU (or an x86 CPU while on ARM). The <os> tag will be also where you would define most of these specifications for more customization for the host OS. I didn't dive deep into using this, I figured out that I needed arch set to 'x86_64', machine set to 'pc-q35-6.2' and then the inner tag set to hvm. Mostly due to that's how VMs created with the cockpit-machine package sets it, along with virt-install, on my system.

<features>
  <acpi/>
  <apic/>
  <vmport state='off'/>
</features>
<clock offset='utc'>
  <timer name='rtc' tickpolicy='catchup'/>
  <timer name='pit' tickpolicy='delay'/>
  <timer name='hpet' present='no'/>
</clock>

<features> is for the hypervisor to know what features are available. <acpi> allows for better power management, while <apic> allows for interrupts to be sent. Other useful tag would be <pae> for larger memory addresses (useful for 8gb ram on a 32bit OS), <hyperv> if you're dealing with Windows guests.

<clock> is to mimic passing through the internal clock of the motherboard to the VM, allowing you to define your own timezone offset based on the time on your system.

<on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>destroy</on_crash>
<pm>
  <suspend-to-mem enabled='no'/>
  <suspend-to-disk enabled='no'/>
</pm>

<on_poweroff>, <on_poweron>, <on_crash> and <pm> tags define how the host OS should handle managing the guest for power states, as well as if the guest should have the ability to sleep.

Devices

Everything under <devices>is everything I decided is needed at a minimum. Some like memballoon are required for KVM/QEMU guests.

<disk> is required for every disk you want to add to your guest. With each disk you need to give the location, whether it be a block device, ISO, raw file, or a qcow2 image file and an address. Even though I originally wanted to do LVM devices, I chose to switch to qcow2 files for now.  Until I can figure out how to do automation better for LVM devices (nearly all the tools are designed around only qcow2 or IMG files right now, with LVM or other block devices being considered advanced features).

<controller> shouldn't be needed since we're doing basic device design, allowing libvirt to pick properly, but I left it since there was no reason to remove it. <channel> is used for host-guest communication. It's not explicitly required, but it makes it easier for graceful shutdown and sharing devices or directories with the guest.

Network <interface>, like disk, is defined for each network interface you want added to your guest OS. These are relatively straightforward as well, and the documentation has some great examples of different use cases. I'm using libvirt networks, hence why the source is using a network type. If you wanted to, you could predefine the mac address the interface would use, but if it's not defined, libvirt will generate it automatically.

<console> is required if you want to be able to access the vm using the command virsh console vm-name-here, I chose to keep it as PTY instead of switching to TTY or virtio-serial. <input> with mouse and keyboard will allow you to interface with vm over console. The <video> tag more than likely is not needed, I haven't played around not including it. It's definitely required if you want to be able to connect over VNC using spice graphics, which is overkill for a headless vm. <rng> is also not needed, but it makes random number generation quicker by using pass-through.

Now, for the pièce de résistance, the magic sauce to get Fedora CoreOS (or RedHat CoreOS, or Flatcar Linux) working. Defining <qemu:commandline> and then adding the argument, -fw_cfg that are passed along to define the kernel parameters to inject the ignition file by the second line with name=opt/com.coreos/config,file=/path/to/ignition/file. Where /path/to/igniton/file is a path to an ignition file on your system.

Ansible Jinja Templating and disk creation.

So, if you noticed, in a few places I have things like {{name}} or {{vcpus}} inside the example above. This is due to making this config into a jinja template, so I can use the same template for all my cluster VMs. To make life easier, inside my vars file I added something that looks like this:

k8s_node:
  - node_name: master1
    vcpus: 2
    vmem: 4096
    disk:
      pool: bulk
      size: 20
    networks:
      - external
      - default
    is_master: true

In the playbook I have it looping through k8s_node.

is_master is currently not being used. Right now, it's acting more like a mental placeholder for me once I get to actually setting up the cluster install.

For networks, I loop through in the template to create the block for both external and default.

I'm not currently doing more than one disk, but I'd handle it similarly to networks after a bit of massaging.

As for the creation of the boot disk, this took a little time to figure out properly. The VM installation of Fedora CoreOS requires the use of the actual vm image, or at least a copy of it. Following this section from the Fedora Docs, while looking up how virt-install works, I realized that backing-store part of the command was creating a copy of the disk and resizing it to the supplied value.

To mimic this behavior, I added 2 separate tasks into the playbook. The first makes a copy of the base image I fetch, the second resizes it to match the value I have inside the vars for that VM.


The Ignition File

An ignition file is what you use to define the setup and configuration of the OS. It's similar to how cloud-init works. In there you can define users, advanced network config (like static IPs), repos and disk configurations.

To get this working as part of the VM, it's a 2 part process. The first, a butane file, and generating the ignition file.

Butane

Butane is a specific YAML-based config file. You can write the files without anything special, but to validate and convert into ignition file you will need the application. There are various ways to install it, but my preferred way is to run it via podman/docker.

The simplest Butane config you can create is one that defines just the user and an ssh key.

variant: fcos
version: 1.4.0
passwd:
  users:
    - name: somename
      ssh_authorized_keys:
        - ssh-rsa AAAA...

I'm taking things a step further, I want everything that's required for running something like kubespray in place. That means I need to do the definition for CRI-O, the Kubernetes repo and a few other things. I didn't figure this out on my own, that credit goes to Carmine.

variant: fcos
version: 1.4.0
passwd:
  users:
    - name: nagelxz
      ssh_authorized_keys:
        - ssh-rsa ...
      home_dir: /home/nagelxz
      password_hash: ...
      groups:
        - wheel
      shell: /bin/bash
kernel_arguments:
  should_exist:
    # Order is significant, so group both arguments into the same list entry.
    - console=tty0 console=ttyS0,115200n8
storage:
  files:
    # CRI-O DNF module
    - path: /etc/dnf/modules.d/cri-o.module
      mode: 0644
      overwrite: true
      contents:
        inline: |
          [cri-o]
          name=cri-o
          stream=1.17
          profiles=
          state=enabled
    # YUM repository for kubeadm, kubelet and kubectl
    - path: /etc/yum.repos.d/kubernetes.repo
      mode: 0644
      overwrite: true
      contents:
        inline: |
          [kubernetes]
          name=Kubernetes
          baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64
          enabled=1
          gpgcheck=1
          repo_gpgcheck=1
          gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg
            https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
    # configuring automatic loading of br_netfilter on startup
    - path: /etc/modules-load.d/br_netfilter.conf
      mode: 0644
      overwrite: true
      contents:
        inline: br_netfilter
    # setting kernel parameters required by kubelet
    - path: /etc/sysctl.d/kubernetes.conf
      mode: 0644
      overwrite: true
      contents:
        inline: |
          net.bridge.bridge-nf-call-iptables=1
          net.ipv4.ip_forward=1
    - path: /etc/hostname
      overwrite: true
      contents:
        inline: master-potato-fries

After you convert that butane file into an ignition file and run the playbook, it'll take about 2–5 minutes to generate the VMs and be available on the IPs that are created. It's an additional setting I need to do, to have Ansible generate the inventory file for the cluster. But I'm leaving that for the tweaks that come in as part of building the next part for running kubespray as part of the playbook.