rebootmgrd, rebootmgr.service, org.opensuse.RebootMgr.conf — Reboot the machine during a maintenance window.
/usr/sbin/rebootmgrd
[ --debug | --help | --version ]
/usr/lib/systemd/system/rebootmgr.service
/etc/dbus-1/system.d/org.opensuse.RebootMgr.conf
To avoid that a whole cluster or a set of machines with the same task reboot at the same there, rebootmgrd reboots the machine following configured policies.
rebootmgr supports different strategies, when a reboot should be done:
When the signal arrives other services will be informed that we plan to reboot and do the reboot without getting any locks or waiting for a maintenance window.
Reboot only during a specified maintenance window. If no window is specified, reboot immediately.
Acquire a lock at etcd for the specified lock-group before reboot. If a maintenance window is specified, acquire the lock only during this window. If taking the lock takes longer than the duration of the maitnenance window, the reboot is canceld and an error logged. This option is only available if rebootmgrd was compiled with etcd support.
This is the default. If etcd is running, use etcd-lock. If no etcd is running, but a maintenance window is specified, use maint-window. If no maintenance window is specified, reboot immediately (instantly).
rebootmgr continues to run, but ignores all signals to reboot. Setting the strategy to off does not clear the maintenance window. If rebootmgr is enabled again, it will continue to use the old specified maintenance window.
The reboot strategy can be configured via rebootmgr.conf(5) and adjusted at runtime via rebootmgrctl(1). This changes will be written to the configuration file and survive the next reboot.
To make sure that not all machines reboot at the same time, the
machines can be sorted into groups and the number of machines of a
group which are allowed to reboot at the same time can be configured
and controlled via etcd. So you can create a group
"etcd_server
", which contains all machines running
etcd, and specify that only one etcd server is allowed to reboot at one
time. And a second group "worker
", in
which a higher number of machines are allowed to reboot at the same
time.
The etcd path to the directory containing data for a group is:
"/opensuse.org/rebootmgr/locks/<group>/
". This
directory contains two variables: "mutex
", which is
by default "0
" and can be set via
atomic_compare_and_swap to "1
" to make sure that only
one machine has write access, and a variable "data
"
containing the following json structure:
{ "max":1, "holders":[] }
"holders
" will contain a unique ID of the machines
holding a lock. As unique ID the machine ID from
/etc/machine-id
is used.
So a record containing two locks out of 10 possible would look like:
{ "max":10, "holders":[ "3cb8c701b4d3474d99a7e88b31dd3439", "71c8efe539b280af2fe09b3b5771345e" ] }
A typical work-flow of a client which tries to reboot would look like:
check, that there are free locks, else watch the data variable until it changes
get the mutex
add our machine ID to the list of machine holding a lock
release the mutex
reboot
on boot, check if we hold a lock. If yes:
get the mutex
remove the machine ID from the list
release the mutex