Reaching the login prompt in 2.5 seconds - a journey

4. Feb 2020 | Fabian Vogt | No License

Not only in development environments it’s very handy to have a quick turnaround time, which can include reboots. Especially for transactional systems where changes to the system only take effect after booting into the new state, this can have a significant impact.

So let’s see what can be done. Remember: “The difference between screwing around and science is writing it down”!

Starting point

Starting point for this experiment is a VM (KVM), 4GiB RAM, 2 CPU cores, no EFI. Tumbleweed was installed as Server (textmode) with just defaults.

# systemd-analyze
Startup finished in 1.913s (kernel) + 2.041s (initrd) + 22.104s (userspace) = 25.958s

Almost 26 seconds just to get to the login prompt of a pretty minimal system, that’s not great. What can we do?

Low-hanging fruit

systemd-analyze blame tells us what the worst offenders are:

# systemd-analyze blame --no-pager
18.769s btrfsmaintenance-refresh.service    
17.027s wicked.service                      
 3.170s plymouth-quit.service               
 3.170s plymouth-quit-wait.service          
 1.078s postfix.service                     
 1.023s apparmor.service                    
  839ms systemd-udev-settle.service         
  601ms systemd-logind.service              
  532ms firewalld.service

btrfsmaintenance-refresh.service is a bit special: It calls systemctl during execution to enable/disable and start/stop the btrfs-*.timer units. Those depend on time-sync.target, which itself needs network.service through chronyd.service. wicked.service is the next item on the list. Before the unit is considered active, it tries to fully configure and setup all configured interfaces, which includes DHCPv4 and v6 by default. This is directly used as state for network.service and thus network.target. There is no distinction between network.service and network-online.target by wicked. To make the bootup quicker, switching to NetworkManager is an option, which interprets network.service in a more async way and thus is much quicker to reach the active state. Note that with DHCP, switching between wicked and NM might result in a different IP address!

# zypper install NetworkManager
# systemctl disable wicked
Removed /etc/systemd/system/multi-user.target.wants/wicked.service.
Removed /etc/systemd/system/network.service.
Removed /etc/systemd/system/network-online.target.wants/wicked.service.
Removed /etc/systemd/system/dbus-org.opensuse.Network.Nanny.service.
Removed /etc/systemd/system/dbus-org.opensuse.Network.AUTO4.service.
Removed /etc/systemd/system/dbus-org.opensuse.Network.DHCP4.service.
Removed /etc/systemd/system/dbus-org.opensuse.Network.DHCP6.service.
# systemctl enable NetworkManager
Created symlink /etc/systemd/system/network.service → /usr/lib/systemd/system/NetworkManager.service.

Let’s also remove plymouth - except for eyecandy it does not provide any useful features.

# zypper rm -u plymouth
Reading installed packages...
Resolving package dependencies...

The following 23 packages are going to be REMOVED:
  gnu-unifont-bitmap-fonts libdatrie1 libdrm2 libfribidi0 libgraphite2-3 libharfbuzz0 libpango-1_0-0 libply5 libply-boot-client5 libply-splash-core5 libply-splash-graphics5 libthai0 libthai-data libXft2 plymouth plymouth-branding-openSUSE
  plymouth-dracut plymouth-plugin-label plymouth-plugin-label-ft plymouth-plugin-two-step plymouth-scripts plymouth-theme-bgrt plymouth-theme-spinner

23 packages to remove.
After the operation, 4.8 MiB will be freed.
Continue? [y/n/v/...? shows all options] (y):
...

Plymouth is still started in the initrd, but as it’s not part of the root filesystem anymore it’s not stopped by plymouth-quit.service. This combination would result in a broken boot! Normally, the initrd should be regenerated automatically if anything relevant changes, but for removals that’s not implemented (boo#966057).

# mkinitrd
...
# reboot
...

Let’s see how much time this saved:

# systemd-analyze
Startup finished in 1.675s (kernel) + 2.066s (initrd) + 2.696s (userspace) = 6.438s

Over 19s saved, that’s quite a lot already!

# systemd-analyze blame --no-pager
1.411s btrfsmaintenance-refresh.service    
893ms systemd-logind.service              
849ms apparmor.service

Now the biggest contributor to the bootup time is btrfsmaintenance-refresh.service. As it is not quite clear how much value it provides with a recent kernel (boo#1063638#c106), let’s just remove it.

# zypper rm -u btrfsmaintenance
# reboot
...
# systemd-analyze
Startup finished in 1.700s (kernel) + 2.010s (initrd) + 2.367s (userspace) = 6.079s
multi-user.target reached after 2.347s in userspace

That’s a bit better again.

# systemd-analyze blame --no-pager
873ms apparmor.service                    
550ms systemd-logind.service              
504ms postfix.service                     
405ms firewalld.service

Both apparmor and systemd-logind.service are needed, so no low hanging fruit remain.

Accelerating early boot

There’s still one part remaining in the boot time equation we can completely eliminate!

Startup finished in 1.700s (kernel) + 2.010s (initrd) + 2.367s (userspace) = 6.079s
                                      ^^^^^^^^^^^^^^^

So, what exactly is the initrd for anyway? In the vast majority of installations, it’s very clearly defined: Mount the real root filesystem and switch to it. Depending on the configuration, this can range from simple (local ext4) to very complex (encrypted block device over the network accepting the password over ssh). Additionally, the kernel binary as loaded by the bootloader is very small, so does not include drivers for every system. Those are part of modules included in the initrd.

Turns out, in simple cases (the majority of VM guest systems) we can boot without initrd just fine. The current setup is not tuned for this setup though, so a few adjustments are required.

Drivers for mounting / without loading modules

The kernel needs drivers for both the virtual devices connecting to the storage device and for the filesystem on it. The former part is dealt with by using the kernel-kvmsmall flavor, but unfortunately it does not have btrfs built-in.

Fortunately, this is easy to fix by rebuilding the kernel with a custom config. By putting CONFIG_FS_BTRFS=y (and some other required options) into config.addon.tar.bz2 next to the standard openSUSE kernel in OBS, it spits out an .rpm with a working binary.

# zypper ar obs://devel:kubic:quickboot/ devel:kubic:quickboot
# zypper in --from devel:kubic:quickboot kernel-kvmsmall

kernel-kvmsmall does not have all kernel features enabled (not even as modules), which means that in some cases it might be necessary to apply the changes on kernel-default instead, which has a complete set of modules.

Mounting root by UUID

However, if you now reboot and comment out the initrd command in the grub config, you will notice that the boot fails as the kernel is unable to find the root device. This is because by default, the GRUB configuration uses root=UUID=deadbeef-1234... as parameter. This is interpreted by the initrd in userspace. To be exact, when a block device is recognized by the kernel, Udev reacts by reading the filesystem UUID and creating a link in /dev/disk/by-uuid/... which is then used as root device. Without an initrd, that does not happen and the kernel is unable to continue.

Workaround here is to set GRUB_DISABLE_LINUX_UUID=true in /etc/default/grub. This means that device paths like root=/dev/vda2 are used, which can lead to issues when changing the disk layout or order. By additionally setting GRUB_DISABLE_LINUX_PARTUUID=false, it uses root=PARTUUID=cafebabe-4554... which is supported by the kernel as well, but is more reliable.

# echo GRUB_DISABLE_LINUX_UUID=true >> /etc/default/grub
# echo GRUB_DISABLE_LINUX_PARTUUID=false >> /etc/default/grub
# grub2-mkconfig -o /boot/grub2/grub.cfg
# reboot
... (comment out the initrd call in grub, after pressing "e" in the menu and prepending "#" in the last line)
# systemd-analyze
Startup finished in 1.778s (kernel) + 2.725s (userspace) = 4.504s

Almost a third shaved off again, awesome! However, there are now error messages shown in the console, which is not that awesome. Something about systemd-gpt-auto-generator and systemd-remount-fs being unable to find the root device - just like the kernel earlier. The cause is actually the same - /etc/fstab still contains the mounts in UUID= format and those errors happen before systemd-udevd.service is started and udev has settled.

No matter how systemd is configured, it’s not possible to get rid of the first error - generators run before any units. So we have to start udev even before that - before systemd!

But first, a quick detour.

Getting transactional

Let’s take a look at how MicroOS is doing. As the name already says, it’s supposed to be lighter than plain Tumbleweed out of the box.

# systemd-analyze 
Startup finished in 1.788s (kernel) + 2.036s (initrd) + 21.243s (userspace) = 25.068s

# systemd-analyze blame --no-pager
17.669s btrfsmaintenance-refresh.service
16.177s wicked.service
 3.377s apparmor.service
 1.356s health-checker.service                
 1.179s systemd-udev-settle.service
  968ms systemd-logind.service
  811ms kdump.service

A second quicker, ok. Plymouth is gone, but we gained health-checker and kdump. The time is dominated by wicked slowing down the startup though, so let’s replace it. Additionally, disable btrfsmaintenance-refresh.service. It’s not possible to remove it as the microos_base pattern requires it.

# transactional-update shell
transactional update # zypper install NetworkManager
transactional update # systemctl disable wicked
transactional update # systemctl enable NetworkManager
transactional update # systemctl disable btrfsmaintenance-refresh.service
transactional update # exit
# reboot
...
# systemd-analyze 
Startup finished in 1.744s (kernel) + 1.989s (initrd) + 2.342s (userspace) = 6.075s

# systemd-analyze blame
1.251s apparmor.service                    
1.066s kdump.service                       
 824ms NetworkManager-wait-online.service  
 742ms systemd-logind.service              
 730ms kdump-early.service                 
 638ms systemd-udevd.service               
 563ms create-dirs-from-rpmdb.service

Much better again.

Booting a read only system without initrd

In a system with a read-only root filesystem like MicroOS (or transactional server), the initrd has another task: Make sure that /var and /etc are mounted already, so that early boot can store logs and read configuration.

So we actually have to mount /var and /etc before starting systemd. How? By having our own init script! It is started directly by the kernel by setting init=/sbin/init.noinitrd and as last step just does exec /sbin/init to replace itself as PID 1 with systemd.

Unfortunately it’s not quite as easy as just doing mount /var and calling it a day, as the mount for /var uses a UUID= as source, so it needs udev running… Luckily, udev actually works in that environment, after mounting /sys, /proc and /run manually.

Here the circle closes - we have udev running before systemd now. So by just using the script we need for read-only systems everywhere, that issue is solved too.

Making initrd-less booting simpler

As the setup for initrd-less booting is quite complex, there’s now a package which does the needed setup automatically (except for installing a suitable kernel).

This contains the needed “pre-init” wrapper script /sbin/init.noinitrd as well as a grub configuration module which automatically adds entries to boot the system without initrd. Those are only generated for kernels which have support for the root filesystem built-in. It takes care of setting root/rootflags and init parameters properly as well. The boot options with initrd are still there, as failsafe.

# zypper ar obs://devel:kubic:quickboot/openSUSE_Tumbleweed devel:kubic:quickboot
# transactional-update initrd shell pkg in --from devel:kubic:quickboot kernel-kvmsmall noinitrd
...
transactional update # grub2-set-default 0
transactional update # exit
# reboot
...

# systemd-analyze 
Startup finished in 1.889s (kernel) + 2.246s (userspace) = 4.135s

# systemd-analyze blame
1.022s apparmor.service                    
 847ms kdump.service                       
 820ms NetworkManager-wait-online.service  
 782ms systemd-logind.service              
 608ms kdump-early.service                 
 542ms dev-vda3.device

Just over 4 seconds! That the /var partition is now part of the top six on systemd-analyze means that we’re getting close to the limits.

An issue with issue-generator

Booting without an initrd seems to have introduced a bug: Instead of showing the active network interface enp1s0 and its addresses, there’s just a lonely eth0: on the login screen. Checking the journal, this is because the interface got renamed from eth0 to enp1s0 during boot. Usually this happens when udev runs in the initrd already, which means that after switch-root there’s an add event with the new name already, which issue-generator picks up. Without initrd, the rename happens in the booted system and issue-generator has to handle that somehow.

How can this be implemented? To find out which udev events are triggered by a rename, the udev monitor is very helpful. With the --property option it shows which properties are attached to the triggered events:

# udevadm monitor --udev --property &
# ip link set lo down
# ip link set lo name lonew
# ip link set lonew name lo
...
UDEV  [630.458943] move     /devices/virtual/net/lo (net)
ACTION=move
DEVPATH=/devices/virtual/net/lo
SUBSYSTEM=net
DEVPATH_OLD=/devices/virtual/net/lonew
INTERFACE=lo
IFINDEX=1
SEQNUM=3616
USEC_INITIALIZED=1054433
ID_NET_LINK_FILE=/usr/lib/systemd/network/99-default.link
SYSTEMD_ALIAS=/sys/subsystem/net/devices/lonew /sys/subsystem/net/devices/lonew
TAGS=:systemd:
# fg
^C
# ip link set lo up

So using the DEVPATH_OLD and INTERFACE properties, the rename can be implemented in issue-generator’s udev rules:

After applying those changes and a reboot, enp1s0 is shown properly now!

Optimizing apparmor.service

Those who paid attention to the output of systemd-analyze blame will have noticed that compared to plain Tumbleweed, apparmor.service takes longer to start on MicroOS. Why is that so?

# systemd-analyze plot > plot.svg

systemd-analyze plot result

In this plot it’s possible to guess which explicit and also implicit dependencies there are between services. If a service starts after a different service ends, it’s probably an explicit dependency. If a service is only up after a different service started, it’s probably waiting for it in some way. That’s what we can see in the plot: apparmor.service only finished startup after create-dirs-from-rpmdb.service started up. So getting that to start earlier or quicker would accelerate apparmor.service as well. To confirm this theory, just disable the service:

# systemctl disable create-dirs-from-rpmdb.service
# reboot
# systemd-analyze blame
927ms systemd-logind.service              
824ms NetworkManager-wait-online.service  
721ms kdump.service                       
689ms kdump-early.service                 
554ms apparmor.service
# systemctl enable create-dirs-from-rpmdb.service

Confirmed. So how can this be optimized properly?

The purpose of this service is to create directories owned by packages, which are not part of the system snapshots. It orders itself between local-fs.target and systemd-tmpfiles-setup.service. Some tmpfiles.d files might rely on packaged directories being present, so it has to run before systemd-tmpfiles-setup.service. Except for changing it to RequiresMountsFor=/var /opt /srv there isn’t much potential for optimization.

However, instead of running on every boot, the service only has to be active if the set of packages changed. Luckily, with rpm 4.15, a new method to ease such checks (rpmdbCookie) got implemented and it was easy to make use of it in the service. With this deployed, it only runs when necessary and otherwise just takes some time to get the cookie from the rpm database:

# systemd-analyze blame
872ms NetworkManager-wait-online.service  
832ms systemd-logind.service              
811ms kdump.service                       
645ms dev-vda3.device                     
597ms kdump-early.service                 
526ms apparmor.service
# systemd-analyze blame | grep create-dirs-from-rpmdb
 52ms create-dirs-from-rpmdb.service

For some reason this doesn’t always help though, sometimes apparmor.service is back at >1s, so this needs some more investigation.

rebootmgr

On top of the blame list we have NetworkManager-wait-online.service. This service can take a variable amount of time depending on the network configration and the environment and is in most cases not needed for getting services up and running. So what is currently pulling that into multi-user.target?

# systemd-analyze critical-chain
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

multi-user.target @2.472s
└─rebootmgr.service @2.284s +24ms
  └─network-online.target @2.282s
    └─NetworkManager-wait-online.service @1.311s +970ms
      └─NetworkManager.service @1.246s +61ms

It’s rebootmgr.service! The reason it orders itself After=network-online.target is that it can directly communicate with etcd. However, support for that is currently disabled in rebootmgr anyway and it appears to handle the case with no network connection on start just fine. So until that change ends up in the package, let’s just adjust that manually:

# systemctl edit --full rebootmgr.service
(Remove lines with network-online.target)
# reboot

Note that this doesn’t really improve the perceived speed of booting as only multi-user.target itself depended on it and sshd/getty are started before that already. The critical chain to multi-user.target is now:

# systemd-analyze critical-chain
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

multi-user.target @2.188s
└─kdump.service @1.287s +899ms
  └─NetworkManager.service @1.202s +82ms
    └─dbus.service @1.197s
      └─basic.target @1.195s
        └─sockets.target @1.195s
          └─dbus.socket @1.195s

Getting close to the edge

It’s time to get creative now - what’s left for optimizing? Everything after this is arguably a hack, much more than the previous changes.

Disabling non-essential services

Let’s just disable everything which isn’t actually needed to get the system up.

apparmor.service: Used for system hardening. If the system is not security relevant (e.g. an isolated VM), this can be disabled. Not recommended though.

rebootmgr.service: If a reboot is scheduled (e.g. by the automatic transactional-update.timer), it triggers an automatic reboot in the configured timeframe (default 3:30am). If the system is rebooted manually, this can be disabled.

kdump.service: Loads a kernel and initrd for kernel coredumping into RAM. Unless the system is a highly critical production machine and every crash has to be analyzed, this can be disabled.

With those changes applied:

# systemd-analyze 
Startup finished in 1.899s (kernel) + 1.505s (userspace) = 3.405s
# systemd-analyze blame
629ms systemd-journald.service            
532ms systemd-logind.service              
480ms dev-vda3.device                     
479ms dev-vda2.device                     
303ms systemd-hostnamed.service

Over a second saved again. systemd-analyze tells us that there’s not much left in userspace to optimize.

Kernel configuration

Currently the kernel takes quite a long time during boot for benchmarking some algorithms. This is done so that it knows which of the available implementations is the quickest on the system it’s running on. Ironically, this means that on CPUs with new features it actually takes a bit longer. If performance for RAID6 is not important, this can simply be disabled by setting CONFIG_RAID6_PQ_BENCHMARK=n

After building such a kernel and installing it:

# systemd-analyze 
Startup finished in 1.083s (kernel) + 1.356s (userspace) = 2.439s

This saves almost a second during kernel startup.

Direct kernel boot

The bootloader takes some time during boot as well (boot menu, kernel loading), which can be optimized as well. Apart from the obvious option to decrease the time the boot menu is shown (or hiding it by default), it’s also possible to skip it altogether! By supplying the kernel and cmdline from the VM host, booting can be even quicker. This is only a good idea in setups where there is a custom kernel build with everything built-in though, as otherwise it may happen that the modules in the VM get out of sync with the kernel image supplied by the VM host. This also breaks automatic rollbacks (health-checker needs GRUB for that) and selecting old snapshots to boot from.

I’m not aware of a way to measure the time it takes to load the kernel, so no measurement here. This mostly removes the time which systemd-analyze doesn’t show (on non-EFI systems).

Conclusion

Getting from 25.958s to 2.439s (with hacks) means that over 90% of boot time can be optimized away.

The next task is to push those optimizations into the distro and make them the default or at least easy to apply.

Have a lot of fun!

Categories: blog

Tags: