my tech blog » Virtualization

Measuring how much RAM a Linux service eats

eli — Fri, 13 Dec 2024 17:20:02 +0000

Introduction

Motivation: I wanted to move a service to another server that is dedicated only to that service. But how much RAM does this new server need? RAM is $$$, so too much is a waste of money, too little means problems.

The method is to run the service and expose it to a scenario that causes it to consume RAM. And then look at the maximal consumption.

This can be done with “top” and similar programs, but these show the current use. I needed the maximal RAM use. Besides, a service may spread out its RAM consumption across several processes. It’s the cumulative consumption that is interesting.

The appealing solution is to use the fact that systemd creates a cgroup for the service. The answer hence lies in the RAM consumption of the cgroup as a whole. It’s also possible to create a dedicated cgroup and run a program within that one, as shown in another post of mine.

This method is somewhat crude, because this memory consumption includes disk cache as well. In other words, this method shows how much RAM is consumed when there’s plenty of memory, and hence when there’s no pressure to reclaim any RAM. Therefore, if the service runs on a server with less RAM (or the service’s RAM consumption is limited in the systemd unit file), it’s more than possible that everything will work just fine. It might run somewhat slower due to disk access that was previously substituted by the cache.

So using a server with as much memory as measured by the test described below (plus some extra for the OS itself) will result in quick execution, but it might be OK to go for less RAM. A tight RAM limit will cause a lot of disk activity at first, and only afterwards will processes be killed by the OOM killer.

Where the information is

All said in this post relates to Linux kernel v4.15. Things are different with later kernels, not necessarily for the better.

There are in principle two versions of the interface with cgroup’s memory management: First, the one I won’t use, which is cgroup-v2 (or maybe this doc for v2 is better?). The sysfs files for this interface for a service named “theservice” reside in /sys/fs/cgroup/unified/system.slice/theservice.service.

I shall be working with the memory control of cgroup-v1. The sysfs files in question are in /sys/fs/cgroup/memory/system.slice/theservice.service/.

If /sys/fs/cgroup/memory/ doesn’t exist, it might be necessary to mount it explicitly. Also, if system.slice doesn’t exist under /sys/fs/cgroup/memory/ it’s most likely because systemd’s memory accounting is not in action. This can be enabled globally, or by setting MemoryAccounting=true on the service’s systemd unit (or maybe any unit?).

Speaking of which, it might be a good idea to set MemoryMax in the service’s systemd unit in order to see what happens when the RAM is really restricted. Or change the limit dynamically, as shown below.

And there’s always the alternative of creating a separate cgroup and running the service in that group. I’ll refer to my own blog post again.

Getting the info

All files mentioned below are in /sys/fs/cgroup/unified/system.slice/theservice.service/ (assuming that the systemd service in question is theservice).

The maximal memory used: memory.max_usage_in_bytes. As it’s name implies this is the maximal amount of RAM used, measured in bytes. This includes disk cache, so the number is higher than what appears in “top”.

The memory currently used: memory.usage_in_bytes.

For more detailed info about memory use: memory.stat. For example:

$ cat memory.stat 
cache 1138688
rss 4268224512
rss_huge 0
shmem 0
mapped_file 516096
dirty 0
writeback 0
pgpgin 36038063
pgpgout 34995738
pgfault 21217095
pgmajfault 176307
inactive_anon 0
active_anon 4268224512
inactive_file 581632
active_file 401408
unevictable 0
hierarchical_memory_limit 4294967296
total_cache 1138688
total_rss 4268224512
total_rss_huge 0
total_shmem 0
total_mapped_file 516096
total_dirty 0
total_writeback 0
total_pgpgin 36038063
total_pgpgout 34995738
total_pgfault 21217095
total_pgmajfault 176307
total_inactive_anon 0
total_active_anon 4268224512
total_inactive_file 581632
total_active_file 401408
total_unevictable 0

Note the “cache” part at the beginning. It’s no coincidence that it’s first. That’s the most important part: How much can be reclaimed just by flushing the cache.

On a 6.1.0 kernel, I’ve seen memory.peak and memory.current instead of memory.max_usage_in_bytes and memory.usage_in_bytes. memory.peak wasn’t writable however (neither in its permissions nor was it possible to write to it), so it wasn’t possible to reset the max level.

Setting memory limits

It’s possible to set memory limits in systemd’s unit file, but it can be more convenient to do this on the fly. In order to set the hard limit of memory use to 40 MiB, go (as root)

# echo 40M > memory.limit_in_bytes

To disable the limit, pick an unreasonably high number, e.g.

# echo 100G > memory.limit_in_bytes

Note that restarting the systemd service has no effect on these parameters (unless a memory limit is required in the unit file). The cgroup directory remains intact.

Resetting between tests

To reset the maximal value that has been recorded for RAM use (as root)

# echo 0 > memory.max_usage_in_bytes

But to really want to start from fresh, all disk cache needs to be cleared as well. The sledge-hammer way is going

# echo 1 > /proc/sys/vm/drop_caches

This frees the page caches system-wide, so everything running on the computer will need to re-read things again from the disk. There’s a slight and temporary global impact on the performance. On a GUI desktop, it gets a bit slow for a while.

A message like this will appear in the kernel log in response:

bash (43262): drop_caches: 1

This is perfectly fine, and indicates no error.

Alternatively, set a low limit for the RAM usage with memory.limit_in_bytes, as shown above. This impacts the cgroup only, forcing a reclaim of disk cache.

Two things that have no effect:

Reducing the soft limit (memory.soft_limit_in_bytes). This limit is relevant only when the system is in a shortage of RAM overall. Otherwise, it does nothing.
Restarting the service with systemd. It wouldn’t make any sense to flush a disk cache when restarting a service.

It’s of course a good idea to get rid of the disk cache before clearing memory.max_usage_in_bytes, so the max value starts without taking the disk cache into account.

Installing GRUB 2 manually with rescue-like techniques

eli — Fri, 12 Jul 2024 15:16:20 +0000

Introduction

It’s rarely necessary to make an issue of installing and maintaining the GRUB bootloader. However, for reasons explained in a separate post, I wanted to install GRUB 2.12 on an old distribution (Debian 8). So it required some acrobatics. That said, it doesn’t limit the possibility to install new kernels in the future etc. If you’re ready to edit a simple text file, rather than running automatic tools, that is. Which may actually be a good idea anyhow.

The basics

Grub has two parts: First, there’s the initial code that is loaded by the BIOS, either from the MBR or from the EFI partition. That’s the plain GRUB executable. This executable goes directly to the ext2/3/4 root partition, and reads from /boot/grub/. That directory contains, among others, the precious grub.cfg file, which GRUB reads in order to decide which modules to load, which menu entries to display and how to act if each is selected.

grub.cfg is created by update-grub, which effectively runs “grub-mkconfig -o /boot/grub/grub.cfg”.

This file is created from /etc/grub.d/ and settings from /etc/default/grub, and based upon the kernel image and initrd files that are found in /boot.

Hence an installation of GRUB consists of two tasks, which are fairly independent:

Running grub-install so that the MBR or EFI partition are set to run GRUB, and that /boot/grub/ is populated with modules and other stuff. The only important thing is that this utility knows the correct disk to target and where the partition containing /boot/grub is.
Running update-grub in order to create (or update) the /boot/grub/grub.cfg file. This is normally done every time the content of /boot is updated (e.g. a new kernel image).

Note that grub-install populates /boot/grub with a lot of files that are used by the bootloader, so it’s necessary to run this command if /boot is wiped and started from fresh.

What made this extra tricky for me, was that Debian 8 comes with an old GRUB 1 version. Therefore, the option of chroot’ing into the filesystem for the purpose of installing GRUB was eliminated.

So there were two tasks to accomplish: Obtaining a suitable grub.cfg and running grub-install in a way that will do the job.

This is a good time to understand what this grub.cfg file is.

The grub.cfg file

grub.cfg is a script, written with a bash-like syntax. and is based upon an internal command set. This is a plain file in /boot/grub/, owned by root:root and writable by root only, for obvious reasons. But for the purpose of booting, permissions don’t make any difference.

Despite the “DO NOT EDIT THIS FILE” comment at the top of this file, and the suggestion to use grub-mkconfig, it’s perfectly OK to edit it for the purposes of updating the behavior of the boot menu. This is unnecessarily complicated in most cases, even when rescuing a system from a Live ISO system: There’s always the possibility to chroot into the target’s root filesystem and call grub-mkconfig from there. That’s usually all that is necessary to update which kernel image / initrd should be kicked off.

That said, it might also be easier to edit this file manually in order to add menu entries for new kernels, for example. In addition, automatic utilities tend to add a lot of specific details that are unnecessary, and that can fail the boot process, for example if the file system’s UUID changes. So maintaining a clean grub.cfg manually can pay off in the long run.

The most interesting part in this file is the menuentry section. Let’s look at a sample command:

menuentry 'Ubuntu' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-a0c2e12e-5d16-4aac-b11d-15cbec5ae98e' {
	recordfail
	load_video
	gfxmode $linux_gfx_mode
	insmod gzio
	if [ x$grub_platform = xxen ]; then insmod xzio; insmod lzopio; fi
	insmod part_gpt
	insmod ext2
	search --no-floppy --fs-uuid --set=root a0c2e12e-5d16-4aac-b11d-15cbec5ae98e
	linux	/boot/vmlinuz-6.8.0-36-generic root=UUID=a0c2e12e-5d16-4aac-b11d-15cbec5ae98e ro
	initrd	/boot/initrd.img-6.8.0-36-generic
}

So these are a bunch of commands that run if the related menu entry is chosen. I’ll discuss “menuentry” and “search” below. Note the “insmod” commands, which load ELF executable modules from /boot/grub/i386-pc/. GRUB also supports lsmod, if you want to try it with GRUB’s interactive command interface.

The menuentry command

The menuentry command is documented here. Let’s break down the command in this example:

menuentry: Obviously, the command itself.
‘Ubuntu’: The title, which is the part presented to the user.
–class ubuntu –class gnu-linux –class gnu –class os: The purpose of these class flags is to help GRUB group the menu options nicer. Usually redundant.
$menuentry_id_option ‘gnulinux-simple-a0c2e12e-5d16-4aac-b11d-15cbec5ae98e’: “$menuentry_id_option” expands into “–id”, so this gives the menu option a unique identifier. It’s useful for submenus, otherwise not required.

Bottom line: If there are no submenus (in the original file there actually are), this header would have done the job as well:

menuentry 'Ubuntu for the lazy' {

The search command

The other interesting part is this row within the menucommand clause:

search --no-floppy --fs-uuid --set=root a0c2e12e-5d16-4aac-b11d-15cbec5ae98e

The search command is documented here. The purpose of this command is to set the $root environment variable, which is what the “–set=root” part means (this is an unnecessary flag, as $root is the target variable anyhow). This tells GRUB in which filesystem to look for the files mentioned in the “linux” and “initrd” commands.

On a system with only one Linux installed, the “search” command is unnecessary: Both $root and $prefix are initialized according to the position of the /boot/grub, so there’s no reason to search for it again.

In this example, the filesystem is defined according to the its UUID , which can be found with this Linux command:

# dumpe2fs /dev/vda2 | grep UUID

It’s better to remove this “search” command if there’s only one /boot directory in the whole system (and it contains the linux kernel files, of course). The advantage is the Linux system can be installed just by pouring all files into an ext4 filesystem (including /boot) and then just run grub-install. Something that won’t work if grub.cfg contains explicit UUIDs. Well, actually, it will work, but with an error message and a prompt to press ENTER: The “search” command fails if the UUID is incorrect, but it wasn’t necessary to begin with, so $root will retain it’s correct value and the system can boot properly anyhow. Given that ENTER is pressed. That hurdle can be annoying on a remote virtual machine.

A sample menuentry command

I added these lines to my grub.cfg file in order to allow future self to try out a new kernel without begin too scared about it:

menuentry 'Unused boot menu entry for future hacks' {
        recordfail
        load_video
        gfxmode $linux_gfx_mode
        insmod gzio
        if [ x$grub_platform = xxen ]; then insmod xzio; insmod lzopio; fi
        insmod part_gpt
        insmod ext2
        linux   /boot/vmlinuz-6.8.12 root=/dev/vda3 ro
}

This is just an implementation of what I said above about the “menuentry” and “search” commands above. In particular, that the “search” command is unnecessary. This worked well on my machine.

As for the other rows, I suggest mixing and matching with whatever appears in your own grub.cfg file in the same places.

Obtaining a grub.cfg file

So the question is: How do I get the initial grub.cfg file? Just take one from a random system? Will that be good enough?

Well, no, that may not work: The grub.cfg is formed differently, depending in particular on how the filesystems on the hard disk are laid out. For example, comparing two grub.cfg files, one had this row:

insmod lvm

and the other didn’t. Obviously, one computer utilized LVM and the other didn’t. Also, in relation to setting the $root variable, there were different variations, going from the “search” method shown above to simply this:

set root='hd0,msdos1'

My solution was to install a Ubuntu 24.04 system on the same KVM virtual machine that I intended to install Debian 8 on later. After the installation, I just copied the grub.cfg and wiped the filesystem. I then installed the required distribution and deleted everything under /boot. Instead, I added this grub.cfg into /boot/grub/ and edited it manually to load the correct kernel.

As I kept the structure of the harddisk and the hardware environment remained unchanged, this worked perfectly fine.

Running grub-install

Truth to be told, I probably didn’t need to use grub-install, since the MBR was already set up with GRUB thanks to the installation I had already carried out for Ubuntu 24.04. Also, I could have copied all other files in /boot/grub from this installation before wiping it. But I didn’t, and it’s a good thing I didn’t, because this way I found out how to do it from a Live ISO. And this might be important for rescue purposes, in the unlikely and very unfortunate event that it’s necessary.

Luckily, grub-install has an undocumented option, –root-directory, which gets the job done.

# grub-install --root-directory=/mnt/new/ /dev/vda
Installing for i386-pc platform.
Installation finished. No error reported.

Note that using –boot-directory isn’t good enough, even if it’s mounted. Only –root-directory makes GRUB detect the correct root directory as the place to fetch the information from. With –boot-directory, the system boots with no menus.

Running update-grub

If you insist on running update-grub, be sure to edit /etc/default/grub and set it this way:

GRUB_TIMEOUT=3
GRUB_RECORDFAIL_TIMEOUT=3

The previous value for GRUB_TIMEOUT is 0, which is supposed to mean to skip the menu. If GRUB deems the boot media not to be writable, it considers every previous boot as a failure (because it can’t know if it was successful or not), and sets the timeout to 30 seconds. 3 seconds are enough, thanks.

And then run update-grub.

# update-grub
Sourcing file `/etc/default/grub'
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-6.8.0-36-generic
Warning: os-prober will not be executed to detect other bootable partitions.
Systems on them will not be added to the GRUB boot configuration.
Check GRUB_DISABLE_OS_PROBER documentation entry.
Adding boot menu entry for UEFI Firmware Settings ...
done

Alternatively, edit grub.cfg and fix it directly.

A note about old GRUB 1

This is really not related to anything else above, but since I made an attempt to install Debian 8′s GRUB on the hard disk at some point, this is what happened:

# apt install grub
# grub --version
grub (GNU GRUB 0.97)

# update-grub 
Searching for GRUB installation directory ... found: /boot/grub
Probing devices to guess BIOS drives. This may take a long time.
Searching for default file ... Generating /boot/grub/default file and setting the default boot entry to 0
Searching for GRUB installation directory ... found: /boot/grub
Testing for an existing GRUB menu.lst file ... 

Generating /boot/grub/menu.lst
Searching for splash image ... none found, skipping ...
Found kernel: /boot/vmlinuz
Found kernel: /boot/vmlinuz-6.8.0-31-generic
Updating /boot/grub/menu.lst ... done

# grub-install /dev/vda
Searching for GRUB installation directory ... found: /boot/grub
The file /boot/grub/stage1 not read correctly.

The error message about /boot/grub/stage1 appears to be horribly misleading. According to this and this, among others, the problem was that the ext4 file system was created with 256 as the inode size, and GRUB 1 doesn’t support that. Which makes sense, as the installation was done on behalf of Ubuntu 24.04 and not a museum distribution.

The solution is apparently to wipe the filesystem correctly:

# mkfs.ext4 -I 128 /dev/vda3

Actually, I don’t know if this was really the problem, because I gave up this old GRUB version quite soon.

Migrating an OpenVZ container to KVM

eli — Fri, 12 Jul 2024 15:11:13 +0000

Introduction

My Debian 8-based web server had been running for several years as an OpenVZ container, when the web host told me that containers are phased out, and it’s time to move on to a KVM.

This is an opportunity to upgrade to a newer distribution, most of you would say, but if a machine works flawlessly for a long period of time, I’m very reluctant to change anything. Don’t touch a stable system. It just happened to have an uptime of 426 days, and the last time this server caused me trouble was way before that.

So the question is if it’s possible to convert a container into a KVM machine, just by copying the filesystem. After all, what’s the difference if /sbin/init (systemd) is kicked off as a plain process inside a container or if the kernel does the same thing?

The answer is yes-ish, this manipulation is possible, but it requires some adjustments.

These are my notes and action items while I found my way to get it done. Everything below is very specific to my own slightly bizarre case, and at times I ended up carrying out tasks in a different order than as listed here. But this can be useful for understanding what’s ahead.

By the way, the wisest thing I did throughout this process, was to go through the whole process on a KVM machine that I built on my own local computer. This virtual machine functioned as a mockup of the server to be installed. Not only did it make the trial and error much easier, but it also allowed me to test all kind of things after the real server was up and running without messing the real machine.

Faking Ubuntu 24.04 LTS

To make things even more interesting, I also wanted to push the next time I’ll be required to mess with the virtual machine as long as possible into the future. Put differently, I wanted to hide the fact that the machine runs on ancient software. There should not be a request to upgrade in the foreseeable future because the old system isn’t compatible with some future version of KVM.

So to the KVM hypervisor, my machine should feel like an Ubuntu 24.04, which was the latest server distribution offered at the time I did this trick. Which brings the question: What does the hypervisor see?

The KVM guest interfaces with its hypervisor in three ways:

With GRUB, which accesses the virtual disk.
Through the kernel, which interacts with the virtual hardware.
Through the guest’s DHCP client, which fetches the IP address, default gateway and DNS from the hypervisor’s dnsmasq.

Or so I hope. Maybe there’s some aspect I’m not aware of. It’s not like I’m such an expert in virtualization.

So the idea was that both GRUB and the kernel should be the same as in Ubuntu 24.04. This way, any KVM setting that works with this distribution will work with my machine. The Naphthalene smell from the user-space software underneath will not reach the hypervisor.

This presumption can turn out to be wrong, and the third item in the list above demonstrates that: The guest machine gets its IP address from the hypervisor through a DHCP request issued by systemd-networkd, which is part of systemd version 215. So the bluff is exposed. Will there be some kind of incompatibility between the old systemd’s DHCP client and some future hypervisor’s response?

Regarding this specific issue, I doubt there will be a problem, as DHCP is such a simple and well-established protocol. And even if that functionality broke, the IP address is fixed anyhow, so the virtual NIC can be configured statically.

But who knows, maybe there is some kind of interaction with systemd that I’m not aware of? Future will tell.

So it boils down to faking GRUB and using a recent kernel.

Solving the GRUB problem

Debian 8 comes with GRUB version 0.97. Could we call that GRUB 1? I can already imagine the answer to my support ticket saying “please upgrade your system, as our KVM hypervisor doesn’t support old versions of GRUB”.

So I need a new one.

Unfortunately, the common way to install GRUB is with a couple of hocus-pocus tools that do the work well in the usual scenario.

As it turns out, there are two parts that need to be installed: The first part consists of the GRUB binary on the boot partition (GRUB partition or EFI, pick your choice), plus several files (modules and other) in /boot/grub/. The second part is a script file, grub.cfg, which is a textual file that can be edited manually.

To make a long story short, I installed the distribution on a virtual machine with the same layout, and made a copy of the grub.cfg file that was created. I then edited this file directly to fit into the new machine. As for installing GRUB binary, I did this from a Live ISO Ubuntu 24.04, so it’s genuine and legit.

For the full and explained story, I’ve written a separate post.

Fitting a decent kernel

This way or another, a kernel and its modules must be added to the filesystem in order to convert it from a container to a KVM machine. This is the essential difference: With a container, one kernel runs all containers and gives them the illusion that they’re the only one. With KVM, the boot starts from the very beginning.

If there was something I didn’t worry about, it was the concept of running an ancient distribution with a very recent kernel. I have a lot of experience with compiling the hot-hot-latest-out kernel and run it with steam engine distributions, and very rarely have I seen any issue with that. The Linux kernel is backward compatible in a remarkable way.

My original idea was to grab the kernel image and the modules from a running installation of Ubuntu 24.04. However, the module format of this distro is incompatible with old Debian 8 (ZST compression seems to have been the crux), and as a result, no modules were loaded.

So I took config-6.8.0-36-generic from Ubuntu 24.04 and used it as the starting point for the .config file used for compiling the vanilla stable kernel with version v6.8.12.

And then there were a few modifications to .config:

“make oldconfig” asked a few questions and made some minor modifications, nothing apparently related.
Dropped kernel module compression (CONFIG_MODULE_COMPRESS_ZSTD off) and set kernel’s own compression to gzip. This was probably the reason the distribution’s modules didn’t load.
Some crypto stuff was disabled: CONFIG_INTEGRITY_PLATFORM_KEYRING, CONFIG_SYSTEM_BLACKLIST_KEYRING and CONFIG_INTEGRITY_MACHINE_KEYRING were dropped, same with CONFIG_LOAD_UEFI_KEYS and most important, CONFIG_SYSTEM_REVOCATION_KEYS was set to “”. Its previous value, “debian/canonical-revoked-certs.pem” made the compilation fail.
Dropped CONFIG_DRM_I915, which caused some weird compilation error.
After making a test run with the kernel, I also dropped CONFIG_UBSAN with everything that comes with it. UBSAN spat a lot of warning messages on mainstream drivers, and it’s really annoying. It’s still unclear to me why these warnings don’t appear on the distribution kernel. Maybe because a difference between compiler versions (the warnings stem from checks inserted by gcc).

The compilation took 32 minutes on a machine with 12 cores (6 hyperthreaded). By far, the longest and most difficult kernel compilation I can remember for a long time.

Based upon my own post, I created the Debian packages for the whole thing, using the bindeb-pkg make target.

That took additional 20 minutes, running on all cores. I used two of these packages in the installation of the KVM machine, as shown in the cookbook below.

Methodology

So the deal with my web host was like this: They started a KVM machine (with a different IP address, of course). I prepared this KVM machine, and when that was ready, I sent a support ticket asking for swapping the IP addresses. This way, the KVM machine became the new server, and the old container machine went to the junkyard.

As this machine involved a mail server and web sites with user content (comments to my blog, for example), I decided to stop the active server, copy “all data”, and restart the server only after the IP swap. In other words, the net result should be as if the same server had been shut down for an hour, and then restarted. No discontinuities.

As it turned out, everything that is related to the web server and email, including the logs of everything, are in /var/ and /home/. So I could therefore copy all files from the old server to the new one for the sake of setting it up, and verify that everything is smooth as a first stage.

Then I shut down the services and copied /var/ and /home/. And then came the IP swap.

This simple command is handy for checking which files have changed during the past week. The first finds the directories, and the second the plain files.

# find / -xdev -ctime -7 -type d | sort
# find / -xdev -ctime -7 -type f | sort

The purpose of the -xdev flag is to remain on one filesystem. Otherwise, a lot of files from /proc and such are printed out. If your system has several relevant filesystems, be sure to add them to “/” in this example.

The next few sections below are the cookbook I wrote for myself in order to get it done without messing around (and hence mess up).

In hindsight, I can say that except for dealing with GRUB and the kernel, most of the hassle had to with the NIC: Its name changed from venet0 to eth0, and it got its address through DHCP relatively late in the boot process. And that required some adaptations.

Preparing the virtual machine

Start the installation Ubuntu 24.04 LTS server edition (or whatever is available, it doesn’t matter much). Possible stop the installation as soon as files are being copied: The only purpose of this step is to partition the disk neatly, so that /dev/vda1 is a small partition for GRUB, and /dev/vda3 is the root filesystem (/dev/vda2 is a swap partition).
Start the KVM machine with a rescue image (preferable graphical or with sshd running). I went for Ubuntu 24.04 LTS server Live ISO (the best choice provided by my web host). See notes below on using Ubuntu’s server ISO as a rescue image.
Wipe the existing root filesystem, if such has been installed. I considered this necessary at the time, because the default inode size may be 256, and GRUB version 1 won’t play ball with that. But later on I decided on GRUB 2. Anyhow, I forced it to be 128 bytes, despite the warning that 128-byte inodes cannot handle dates beyond 2038 and are deprecated:
```
# mkfs.ext4 -I 128 /dev/vda3
```
And since I was at it, no automatic fsck check. Ever. It’s really annoying when you want to kick off the server quickly.
```
# tune2fs -c 0 -i 0 /dev/vda3
```

Mount new system as /mnt/new:

# mkdir /mnt/new
# mount /dev/vda3 /mnt/new

Copy the filesystem. On the OpenVZ machine:
```
# tar --one-file-system -cz / | nc -q 0 185.250.251.160 1234 > /dev/null
```
and the other side goes (run this before the command above):
```
# nc -l 1234 < /dev/null | time tar -C /mnt/new/ -xzv
```
This took about 30 minutes. The purpose of the “-q 0″ flag and those /dev/null redirections is merely to make nc quit when the tar finishes.
Or, doing the same from a backup tarball:
```
$ cat myserver-all-24.07.08-08.22.tar.gz | nc -q 0 -l 1234 > /dev/null
```
and the other side goes
```
# nc 10.1.1.3 1234 < /dev/null | time tar -C /mnt/new/ -xzv
```

Remove old /lib/modules and boot directory:

# rm -rf /mnt/new/lib/modules/ /mnt/new/boot/

Create /boot/grub and copy the grub.cfg file that I’ve prepared in advance to there. This separate post explains the logic behind doing it this way.
Install GRUB on the boot parition (this also adds a lot of files to /boot/grub/):
```
# grub-install --root-directory=/mnt/new /dev/vda
```

In order to work inside the chroot, some bind and tmpfs mounts are necessary:

# mount -o bind /dev /mnt/new/dev
# mount -o bind /sys /mnt/new/sys
# mount -t proc /proc /mnt/new/proc
# mount -t tmpfs tmpfs /mnt/new/tmp
# mount -t tmpfs tmpfs /mnt/new/run

Copy the two .deb files that contain the Linux kernel files to somewhere in /mnt/new/
Chroot into the new fs:
```
# chroot /mnt/new/
```
Check that /dev, /sys, /proc, /run and /tmp are as expected (mounted correctly).
Disable and stop these services: bind9, sendmail, cron.
This wins the prize for the oddest fix: Probably in relation to the OpenVZ container, the LSB modules_dep service is active, and it deletes all module files in /lib/modules on reboot. So make sure to never see it again. Just disabling it wasn’t good enough.
```
# systemctl mask modules_dep.service
```
Install the Linux kernel and its modules into /boot and /lib/modules:
```
# dpkg -i linux-image-6.8.12-myserver_6.8.12-myserver-2_amd64.deb
```

Also install the headers for compilation (why not?)

# dpkg -i linux-headers-6.8.12-myserver_6.8.12-myserver-2_amd64.deb

Add /etc/systemd/network/20-eth0.network
```
[Match]
Name=eth0

[Network]
DHCP=yes
```
The NIC was a given in a container, but now it has to be raised explicitly and the IP address possibly obtained from the hypervisor via DHCP, as I’ve done here.
Add the two following lines to /etc/sysctl.conf, in order to turn off IPv6:
```
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
```
Adjust the firewall rules, so that they don’t depend on the server having a specific IP address (because a temporary IP address will be used).
Add support for lspci (better do it now if something goes wrong after booting):
```
# apt install pciutils
```
Ban the evbug module, which is intended to generate debug message on input devices. Unfortunately, it floods the kernel log sometimes when the mouse goes over the virtual machine’s console window. So ditch it by adding /etc/modprobe.d/evbug-blacklist.conf having this single line:
```
blacklist evbug
```
Edit /etc/fstab. Remove everything, and leave only this row:
```
/dev/vda3 / ext4 defaults 0 1
```
Remove persistence udev rules, if such exist, at /etc/udev/rules.d. Oddly enough, there was nothing in this directory, not in the existing OpenVZ server and not in a regular Ubuntu 24.04 server installation.
Boot up the system from disk, and perform post-boot fixes as mentioned below.

Post-boot fixes

Verify that /tmp is indeed mounted as a tmpfs.
Disable (actually, mask) the automount service, which is useless and fails. This makes systemd’s status degraded, which is practically harmless, but confusing.
```
# systemctl mask proc-sys-fs-binfmt_misc.automount
```

Install the dbus service:

# apt install dbus

Not only is it the right thing to do on a Linux system, but it also silences this warning:

Cannot add dependency job for unit dbus.socket, ignoring: Unit dbus.socket failed to load: No such file or directory.

Enable login prompt on the default visible console (tty1) so that a prompt appears after all the boot messages:
```
# systemctl enable getty@tty1.service
```
The other tty’s got a login prompt when using Ctrl-Alt-Fn, but not the visible console. So this fixed it. Otherwise, one can be mislead into thinking that the boot process is stuck.
Optionally: Disable vzfifo service and remove /.vzfifo.

Just before the IP address swap

Reboot the openVZ server to make sure that it wakes up OK.
Change the openVZ server’s firewall, so works with a different IP address. Otherwise, it becomes unreachable after the IP swap.
Boot the target KVM machine in rescue mode. No need to set up the ssh server as all will be done through VNC.
On the KVM machine, mount new system as /mnt/new:
```
# mkdir /mnt/new
# mount /dev/vda3 /mnt/new
```

On the OpenVZ server, check for recently changed directories and files:

# find / -xdev -ctime -7 -type d | sort > recently-changed-dirs.txt
# find / -xdev -ctime -7 -type f | sort > recently-changed-files.txt

Verify that the changes are only in the places that are going to be updated. If not, consider if and how to update these other files.
Verify that the mail queue is empty, or let sendmail empty it if possible. Not a good idea to have something firing off as soon as sendmail resumes:
```
# mailq
```

Disable all services except sshd on the OpenVZ server:

# systemctl disable cron dovecot apache2 bind9 sendmail mysql xinetd

Run “mailq” again to verify that the mail queue is empty (unless there was a reason to leave a message there in the previous check).
Reboot OpenVZ server and verify that none of these is running. This is the point at which this machine is dismissed as a server, and the downtime clock begins ticking.
Verify that this server doesn’t listen to any ports except ssh, as an indication that all services are down:
```
# netstat -n -a | less
```
Repeat the check of recently changed files.
On KVM machine, remove /var and /home.
```
# rm -rf /mnt/new/var /mnt/new/home
```

Copy these parts:
On the KVM machine, using the VNC console, go

# nc -l 1234 < /dev/null | time tar -C /mnt/new/ -xzv

and on myserver:

# tar --one-file-system -cz /var /home | nc -q 0 185.250.251.160 1234 > /dev/null

Took 28 minutes.

Check that /mnt/new/tmp and /mnt/tmp/run are empty and remove whatever is found, if there’s something there. There’s no reason for anything to be there, and it would be weird if there was, given the way the filesystem was copied from the original machine. But if there are any files, it’s just confusing, as /tmp and /run are tmpfs on the running machine, so any files there will be invisible anyhow.
Reboot the KVM machine with a reboot command. It will stop anyhow for removing the CDROM.
Remove the KVM’s CDROM and continue the reboot normally.
Login to the KVM machine with ssh.
Check that all is OK: systemctl status as well as journalctl. Note that the apache, mysql and dovecot should be running now.
Power down both virtual machines.
Request an IP address swap. Let them do whatever they want with the IPv6 addresses, as they are ignored anyhow.

After IP address swap

Start the KVM server normally, and login normally through ssh.
Try to browse into the web sites: The web server should already be working properly (even though the DNS is off, but there’s a backup DNS).
Check journalctl and systemctl status.
Resume the original firewall rules and verify that the firewall works properly:
```
# systemctl restart netfilter-persistent
# iptables -vn -L
```

Start all services, and check status and journalctl again:

# systemctl start cron dovecot apache2 bind9 sendmail mysql xinetd

If all is fine, enable these services:

# systemctl enable cron dovecot apache2 bind9 sendmail mysql xinetd

Reboot (with reboot command), and check that all is fine.
In particular, send DNS queries directly to the server with dig, and also send an email to a foreign address (e.g. gmail). My web host blocked outgoing connections to port 25 on the new server, for example.
Delete ifcfg-venet0 and ifcfg-venet0:0 in /etc/sysconfig/network-scripts/, as they relate to the venet0 interface that exists only in the container machine. It’s just misleading to have it there.
Compare /etc/rc* and /etc/systemd with the situation before the transition in the git repo, to verify that everything is like it should be.

Check the server with nmap (run this from another machine):
```
$ nmap -v -A server
$ sudo nmap -v -sU server
```

And then the DNS didn’t work

I knew very well why I left plenty of time free for after the IP swap. Something will always go wrong after a maneuver like this, and this time was no different. And for some odd reason, it was the bind9 DNS that played two different kinds of pranks.

I noted immediately that the server didn’t answer to DNS queries. As it turned out, there were two apparently independent reasons for it.

The first was that when I re-enabled the bind9 service (after disabling it for the sake of moving), systemctl went for the SYSV scripts instead of its own. So I got:

# systemctl enable bind9
Synchronizing state for bind9.service with sysvinit using update-rc.d...
Executing /usr/sbin/update-rc.d bind9 defaults
insserv: warning: current start runlevel(s) (empty) of script `bind9' overrides LSB defaults (2 3 4 5).
insserv: warning: current stop runlevel(s) (0 1 2 3 4 5 6) of script `bind9' overrides LSB defaults (0 1 6).
Executing /usr/sbin/update-rc.d bind9 enable

This could have been harmless and gone unnoticed, had it not been that I’ve added a “-4″ flag to bind9′s command, or else it wouldn’t work. So by running the SYSV scripts, my change in /etc/systemd/system/bind9.service wasn’t in effect.

Solution: Delete all files related to bind9 in /etc/init.d/ and /etc/rc*.d/. Quite aggressive, but did the job.

Having that fixed, it still didn’t work. The problem now was that eth0 was configured through DHCP after the bind9 had begun running. As a result, the DNS didn’t listen to eth0.

I slapped myself for thinking about adding a “sleep” command before launching bind9, and went for the right way to do this. Namely:

$ cat /etc/systemd/system/bind9.service
[Unit]
Description=BIND Domain Name Server
Documentation=man:named(8)
After=network-online.target systemd-networkd-wait-online.service
Wants=network-online.target systemd-networkd-wait-online.service

[Service]
ExecStart=/usr/sbin/named -4 -f -u bind
ExecReload=/usr/sbin/rndc reload
ExecStop=/usr/sbin/rndc stop

[Install]
WantedBy=multi-user.target

The systemd-networkd-wait-online.service is not there by coincidence. Without it, bind9 was launched before eth0 had received an address. With this, systemd consistently waited for the DHCP to finish, and then launched bind9. As it turned out, this also delayed the start of apache2 and sendmail.

If anything, network-online.target is most likely redundant.

And with this fix, the crucial row appeared in the log:

named[379]: listening on IPv4 interface eth0, 193.29.56.92#53

Another solution could have been to assign an address to eth0 statically. For some odd reason, I prefer to let DHCP do this, even though the firewall will block all traffic anyhow if the IP address changes.

Using Live Ubuntu as rescue mode

Set Ubuntu 24.04 server amd64 as the CDROM image.

After the machine has booted, send a Ctrl-Alt-F2 to switch to the second console. Don’t go on with the installation wizard, as it will of course wipe the server.

In order to establish an ssh connection:

Choose a password for the default user (ubuntu-server).
```
$ passwd
```
If you insist on a weak password, remember that you can do that only as root.
Use ssh to log in:
```
$ ssh ubuntu-server@185.250.251.160
```

Root login is forbidden (by default), so don’t even try.

Note that even though sshd apparently listens only to IPv6 ports, it’s actually accepting IPv4 connection by virtue of IPv4-mapped IPv6 addresses:

# lsof -n -P -i tcp 2>/dev/null
COMMAND    PID            USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
systemd      1            root  143u  IPv6   5323      0t0  TCP *:22 (LISTEN)
systemd-r  911 systemd-resolve   15u  IPv4   1766      0t0  TCP 127.0.0.53:53 (LISTEN)
systemd-r  911 systemd-resolve   17u  IPv4   1768      0t0  TCP 127.0.0.54:53 (LISTEN)
sshd      1687            root    3u  IPv6   5323      0t0  TCP *:22 (LISTEN)
sshd      1847            root    4u  IPv6  11147      0t0  TCP 185.250.251.160:22->85.64.140.6:57208 (ESTABLISHED)
sshd      1902   ubuntu-server    4u  IPv6  11147      0t0  TCP 185.250.251.160:22->85.64.140.6:57208 (ESTABLISHED)One can get the impression that sshd listens only to IPv6. But somehow, it also accepts

So don’t get confused by e.g. netstat and other similar utilities.

To NTP or not?

I wasn’t sure if I should run an NTP client inside a KVM virtual machine. So these are the notes I took.

This is a nice tutorial to start with.
It’s probably a good idea to run an NTP client on the client. It would have been better to utilize the PTP protocol, and get the host’s clock directly. But this is really an overkill. The drawback with these daemons is that if the client goes down and back up again, it will start with the old time, and then jump.
It’s also a good idea to use kvm_clock in addition to NTP. This kernel feature uses the pvclock protocol to lets guest virtual machines read the host physical machine’s wall clock time as well as its TSC. See this post for a nice tutorial about kvm_clock.
In order to know which clock source the kernel uses, look in /sys/devices/system/clocksource/clocksource0/current_clocksource. Quite expectedly, it was kvm-clock (available sources were kvm-clock, tsc and acpi_pm).
It so turned out that systemd-timesyncd started running without my intervention when moving from a container to KVM.

On a working KVM machine, timesyncd tells about its presence in the log:

Jul 11 20:52:52 myserver systemd-timesyncd[197]: interval/delta/delay/jitter/drift 2048s/+0.001s/0.007s/0.003s/+0ppm
Jul 11 21:27:00 myserver systemd-timesyncd[197]: interval/delta/delay/jitter/drift 2048s/-0.000s/0.007s/0.001s/+0ppm
Jul 11 22:01:08 myserver systemd-timesyncd[197]: interval/delta/delay/jitter/drift 2048s/-0.002s/0.007s/0.001s/+0ppm
Jul 11 22:35:17 myserver systemd-timesyncd[197]: interval/delta/delay/jitter/drift 2048s/-0.001s/0.007s/0.001s/+0ppm
Jul 11 23:09:25 myserver systemd-timesyncd[197]: interval/delta/delay/jitter/drift 2048s/+0.007s/0.007s/0.003s/+0ppm
Jul 11 23:43:33 myserver systemd-timesyncd[197]: interval/delta/delay/jitter/drift 2048s/-0.003s/0.007s/0.005s/+0ppm (ignored)
Jul 12 00:17:41 myserver systemd-timesyncd[197]: interval/delta/delay/jitter/drift 2048s/-0.006s/0.007s/0.005s/-1ppm
Jul 12 00:51:50 myserver systemd-timesyncd[197]: interval/delta/delay/jitter/drift 2048s/+0.001s/0.007s/0.005s/+0ppm
Jul 12 01:25:58 myserver systemd-timesyncd[197]: interval/delta/delay/jitter/drift 2048s/+0.002s/0.007s/0.005s/+0ppm
Jul 12 02:00:06 myserver systemd-timesyncd[197]: interval/delta/delay/jitter/drift 2048s/+0.002s/0.007s/0.005s/+0ppm
Jul 12 02:34:14 myserver systemd-timesyncd[197]: interval/delta/delay/jitter/drift 2048s/-0.001s/0.007s/0.005s/+0ppm
Jul 12 03:08:23 myserver systemd-timesyncd[197]: interval/delta/delay/jitter/drift 2048s/-0.000s/0.007s/0.005s/+0ppm
Jul 12 03:42:31 myserver systemd-timesyncd[197]: interval/delta/delay/jitter/drift 2048s/-0.001s/0.007s/0.004s/+0ppm
Jul 12 04:17:11 myserver systemd-timesyncd[197]: interval/delta/delay/jitter/drift 2048s/-0.000s/0.007s/0.003s/+0ppm

So a resync takes place every 2048 seconds (34 minutes and 8 seconds), like a clockwork. As apparent from the values, there’s no dispute about the time between Debian’s NTP server and the web host’s hypervisor.

Running KVM on Linux Mint 19 random jots

eli — Fri, 12 Jul 2024 14:54:45 +0000

General

Exactly like my previous post from 14 years ago, these are random jots that I took as I set up a QEMU/KVM-based virtual machine on my Linux Mint 19 computer. This time, the purpose was to prepare myself for moving a server from an OpenVZ container to KVM.

Other version details, for the record: libvirt version 4.0.0, QEMU version 2.11.1, Virtual Machine manager 1.5.1.

Installation

Install some relevant packages:

# apt install qemu-kvm qemu-utils libvirt-daemon-system libvirt-clients virt-manager virt-viewer ebtables ovmf

This clearly installed a few services: libvirt-bin, libvirtd, libvirt-guest, virtlogd, qemu-kvm, ebtables, and a couple of sockets: virtlockd.socket and virtlogd.socket with their attached services.

My regular username on the computer was added automatically to the “libvirt” group, however that doesn’t take effect until one logs out and and in again. Without belonging to this group, one gets the error message “Unable to connect to libvirt qemu:///system” when attempting to run the Virtual Machine Manager. Or in more detail: “libvirtError: Failed to connect socket to ‘/var/run/libvirt/libvirt-sock’: Permission denied”.

The lazy and temporary solution is to run the Virtual Machine Manager with “sg”. So instead of the usual command for starting the GUI tool (NOT as root):

$ virt-manager &

Use “sg” (or start a session with the “newgroup” command):

$ sg libvirt virt-manager &

This is necessary only until next time you log in to the console. I think. I didn’t get that far. Who logs out?

There’s also a command-line utility, virsh. For example, to list all running machines:

$ sudo virsh list

Or just “sudo virsh” for an interactive shell.

Note that without root permissions, the list is simply empty. This is really misleading.

General notes

Virtual machines are called “domains” in several contexts (within virsh in particular).
To get the mouse out of the graphical window, use Ctrl-Alt.
For networking to work, some rules related to virbr0 are automatically added to the iptables firewall. If these are absent, go “systemctl restart libvirtd” (don’t do this with virtual machines running, of course).
These iptables rules are important in particular for WAN connections. Apparently, these allow virbr0 to make DNS queries to the local machine (adding rules to INPUT and OUTPUT chains). In addition, the FORWARD rule allows forwarding anything to and from virbr0 (as long as the correct address mask is matched). Plus a whole lot off stuff around POSTROUTING. Quite disgusting, actually.
There are two Ethernet interfaces related to KVM virtualization: vnet0 and virbr0 (typically). For sniffing, virbr0 is a better choice, as it’s the virtual machine’s own bridge to the system, so there is less noise. This is also the interface that has an IP address of its own.
A vnetN pops up for each virtual machine that is running, virbr0 is there regardless.
The configuration files are kept as fairly readable XML files in /etc/libvirt/qemu
The images are typically held at /var/lib/libvirt/images, owned by root with 0600 permissions.
The libvirtd service runs /usr/sbin/libvirtd as well as two processes of /usr/sbin/dnsmasq. When a virtual machine runs, it also runs an instance of qemu-system-x86_64 on its behalf.

Creating a new virtual machine

Start the Virtual Manager. The GUI is good enough for my purposes.

$ sg libvirt virt-manager &

Click on the “Create new virtual machine” and choose “Local install media”. Set the other parameters as necessary.
As for storage, choose “Select or create custom storage” and create a qcow2 volume in a convenient position on the disk (/var/lib/libvirt/images is hardly a good place for that, as it’s on the root partition).
In the last step, choose “customize configuration before install”.
Network selection: Virtual nework ‘default’: NAT.
Change the NIC, Disk and Video to VirtIO as mentioned below.
Click “Begin Installation”.

Do it with VirtIO

That is, use Linux’ paravirtualization drivers, rather than emulation of hardware.

To set up a machine’s settings, go View > Details.

This is lspci’s response with a default virtual machine:

00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Red Hat, Inc. QXL paravirtual graphic card (rev 04)
00:03.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8100/8101L/8139 PCI Fast Ethernet Adapter (rev 20)
00:04.0 Audio device: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) High Definition Audio Controller (rev 01)
00:05.0 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 03)
00:05.1 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (rev 03)
00:05.2 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 (rev 03)
00:05.7 USB controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 03)
00:06.0 Communication controller: Red Hat, Inc Virtio console
00:07.0 Unclassified device [00ff]: Red Hat, Inc Virtio memory balloon

Cute, but all interfaces are emulations of real hardware. In other words, this will run really slowly.

Testing link speed: On the host machine:

$ nc -l 1234 < /dev/null > /dev/null

And on the guest:

$ dd if=/dev/zero bs=128k count=4k | nc -q 0 10.1.1.3 1234
4096+0 records in
4096+0 records out
536870912 bytes (537 MB, 512 MiB) copied, 3.74558 s, 143 MB/s

Quite impressive for hardware emulation, I must admit. But it can get better.

Things to change from the default settings:

NIC: Choose “virtio” as device model, keep “Virtual network ‘default’” as NAT.
Disk: On “Disk bus”, don’t use IDE, but rather “VirtIO” (it will appear as /dev/vda etc.).
Video: Don’t use QXL, but Virtio (without 3D acceleration, it wasn’t supported on my machine). Actually, I’m not so sure about this one. For example, Ubuntu’s installation live boot gave me a black screen occasionally with Virtio.

Note that it’s possible to use a VNC server instead of “Display spice”.

After making these changes:

00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Red Hat, Inc Virtio GPU (rev 01)
00:03.0 Ethernet controller: Red Hat, Inc Virtio network device
00:04.0 Audio device: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) High Definition Audio Controller (rev 01)
00:05.0 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 03)
00:05.1 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (rev 03)
00:05.2 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 (rev 03)
00:05.7 USB controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 03)
00:06.0 Communication controller: Red Hat, Inc Virtio console
00:07.0 Unclassified device [00ff]: Red Hat, Inc Virtio memory balloon
00:08.0 SCSI storage controller: Red Hat, Inc Virtio block device

Try the speed test again?

$ dd if=/dev/zero bs=128k count=4k | nc -q 0 10.1.1.3 1234
4096+0 records in
4096+0 records out
536870912 bytes (537 MB, 512 MiB) copied, 0.426422 s, 1.3 GB/s

Almost ten times faster.

Preparing a live Ubuntu ISO for ssh

$ sudo su
# apt install openssh-server
# passwd ubuntu

In the installation of the openssh-server, there’s a question of which configuration files to use. Choose the package maintainer’s version.

Run Firefox over X11 over SSH / VNC on a cheap virtual machine

eli — Tue, 16 Nov 2021 09:49:16 +0000

To run over SSH: Not

This is how to run a Firefox browser on a cheap VPS machine (e.g. a Google Cloud VM Instance) with an X-server connection. It’s actually not a good idea, because it’s extremely slow. The correct way is to set up a VNC server, because the X server connection exchanges information on every little mouse movement or screen update. It’s a disaster on a slow connection.

My motivation was to download a 10 GB file from Microsoft’s cloud storage. With my own Internet connection it failed consistently after a Gigabyte or so (I guess the connection timed out). So the idea is to have Firefox running on a remote server with a much better connection. And then transfer the file.

Since it’s a one-off task, and I kind-of like these bizarre experiments, here we go.

These steps:

Edit /etc/ssh/sshd_config, making sure it reads

X11Forwarding yes

Install xauth, also necessary to open a remote X:

# apt install xauth

Then restart the ssh server:

# systemctl restart ssh

and then install Firefox

# apt install firefox-esr

There will be a lot of dependencies to install.

At this point, it’s possible to connect to the server with ssh -X and run firefox on the remote machine.

Expect a horribly slow browser, though. Every small animation or mouse movement is transferred on the link, so it definitely gets stuck easily. So think before every single move, and think about every single little thing in the graphics that gets updated.

Firefox “cleverly” announces that “a web page is slowing down your browser” all the time, but the animation of these announcements become part of the problem.

It’s also a good idea to keep the window small, so there isn’t much to area to keep updated. And most important: Keep the mouse pointer off the remote window unless it’s needed there for a click. Otherwise things get stuck. Just gen into the window, click, and leave. Or stay if the click was for the sake of typing (or better, pasting something).

Run over VNC instead

This requires installing an X-Windows server. Not a big deal.

# apt update
# apt-get install xfce4
# apt install x-window-system

once installed, open a VNC window. It’s really easiest by clicking a button on the user’s VPS Client Area (also available on the control panel, but why go that far) and go

# startx

at command prompt to start the server. And then start the browser as usual.

It doesn’t make sense to have a login server as it slows down the boot process and eats memory. Unless a VNC connection is the intended way to always use the virtual machine.

Firefox is still quite slow, but not as bad as with ssh.

Using firejail to throttle network bandwidth for wget and such

eli — Sun, 15 Aug 2021 13:39:17 +0000

Introduction

Occasionally, I download / upload huge files, and it kills my internet connection for plain browsing. I don’t want to halt the download or suspend it, but merely calm it down a bit, temporarily, for doing other stuff. And then let it hog as much as it want again.

There are many ways to do this, and I went for firejail. I suggest reading this post of mine as well on this tool.

Firejail gives you a shell prompt, which runs inside a mini-container, like those cheap virtual hosting services. Then run wget or youtube-dl as you wish from that shell.

It has practically access to everything on the computer, but the network interface is controlled. Since firejail is based on cgroups, all processes and subprocesses are collectively subject to the network bandwidth limit.

Using firejail requires setting up a bridge network interface. This is a bit of container hocus-pocus, and is necessary to get control over the network data flow. But it’s simple, and it can be done once (until the next reboot, unless the bridge is configured permanently, something I don’t bother).

Setting up a bridge interface

Remember: Do this once, and just don’t remove the interface when done with it.

You might need to

# apt install bridge-utils

So first, set up a new bridge device (as root):

# brctl addbr hog0

and give it an IP address that doesn’t collide with anything else on the system. Otherwise, it really doesn’t matter which:

# ifconfig hog0 10.22.1.1/24

What’s going to happen is that there will be a network interface named eth0 inside the container, which will behave as if it was connected to a real Ethernet card named hog0 on the computer. Hence the container has access to everything that is covered by the routing table (by means of IP forwarding), and is also subject to the firewall rules. With my specific firewall setting, it prevents some access, but ppp0 isn’t blocked, so who cares.

To remove the bridge (no real reason to do it):

# brctl delbr hog0

Running the container

Launch a shell with firejail (I called it “nethog” in this example):

$ firejail --net=hog0 --noprofile --name=nethog

This starts a new shell, for which the bandwidth limit is applied. Run wget or whatever from here.

Note that despite the –noprofile flag, there are still some directories that are read-only and some are temporary as well. It’s done in a sensible way, though so odds are that it won’t cause any issues. Running “df” inside the container gives an idea on what is mounted how, and it’s scarier than the actual situation.

But be sure to check that the files that are downloaded are visible outside the container.

From another shell prompt, outside the container go something like (doesn’t require root):

$ firejail --bandwidth=nethog set hog0 800 75
Removing bandwith limit
Configuring interface eth0
Download speed  6400kbps
Upload speed  600kbps
cleaning limits
configuring tc ingress
configuring tc egress

To drop the bandwidth limit:

$ firejail --bandwidth=nethog clear hog0

And get the status (saying, among others, how many packets have been dropped):

$ firejail --bandwidth=nethog status

Notes:

The “eth0″ mentioned in firejail’s output blob relates to the interface name inside the container. So the “real” eth0 remains untouched.
Actual download speed is slightly slower.
The existing group can be joined by new processes with firejail –join, as well as from firetools.
Several containers may use the same bridge (hog0 in the example above), in which case each has its own independent bandwidth setting. Note that the commands configuring the bandwidth limits mention both the container’s name and the bridge.

Working with browsers

When starting a browser from within a container, pay attention to whether it really started a new process. Using firetools can help.

If Google Chrome says “Created new window in existing browser session”, it didn’t start a new process inside the container, in which case the window isn’t subject to bandwidth limitation.

So close all windows of Chrome before kicking off a new one. Alternatively, this can we worked around by starting the container with.

$ firejail --net=hog0 --noprofile --private --name=nethog

The –private flags creates, among others, a new volatile home directory, so Chrome doesn’t detect that it’s already running. Because I use some other disk mounts for the large partitions on my computer, it’s still possible to download stuff to them from within the container.

But extra care is required with this, and regardless, the new browser doesn’t remember passwords and such from the private container.

Using a different version of Google Chrome

This isn’t really related, and yet: What if I want to use a different version of Chrome momentarily, without upgrading? This can be done by downloading the .deb package, and extracting its files as shown on this post. Then copy the directory opt/google/chrome in the package’s “data” files to somewhere reachable by the jail (e.g. /bulk/transient/google-chrome-105.0/).

All that is left is to start a jail with the –private option as shown above (possibly without the –net flag, if throttling isn’t required) and go e.g.

$ /bulk/transient/google-chrome-105.0/chrome &

So the new browser can run while there are still windows of the old one open. The advantage and disadvantage of jailing is that there’s no access to the persistent data. So the new browser doesn’t remember passwords. This is also an advantage, because there’s a chance that the new version will mess up things for the old version.

Firejail: Putting a program in its own little container

eli — Thu, 11 Jun 2020 03:39:11 +0000

Introduction

Firejail is a lightweight security utility which ties the hands of running processes, somewhat like Apparmor and SELinux. However it takes the mission towards Linux kernel’s cgroups and namespaces. It’s in fact a bit of a container-style virtualization utility, which creates sandboxes for running specific programs: Instead of a container for an entire operating system, it makes one for each application (i.e. the main process and its children). Rather than disallowing access from files and directories by virtue of permissions, simply make sure they aren’t visible to the processes. Same goes for networking.

By virtue of Cgroups, several security restrictions are also put in place regardless if so desired. Certain syscalls can be prevented etc. But in the end of the day, think container virtualization. A sandbox is created, and everything happens inside it. It’s also easy to add processes to an existing sandbox (in particular, start a new shell). Not to mention the joy of shutting down a sandbox, that is, killing all processes inside it.

While the main use of Firejail to protect the file system from access and tampering by malicious or infected software, it also allows more or less everything that a container-style virtual machine does: Control of network traffic (volume, dedicated firewall, which physical interfaces are exposed) as well as activity (how many subprocesses, CPU and memory utilization etc.). And like a virtual machine, it also allows statistics on resource usage.

Plus spoofing the host name, restricting access to sound devices, X11 capabilities and a whole range of stuff.

And here’s the nice thing: It doesn’t require root privileges to run. Sort of. The firejail executable is run with setuid.

It’s however important to note that firejail doesn’t create a stand-alone container. Rather, it mixes and matches files from the real file system and overrides selected parts of the directory tree with temporary mounts. Or overlays. Or whiteouts.

In fact, compared with the accurate rules of a firewall, its behavior is quite loose and inaccurate. For a newbie, it’s a bit difficult to predict exactly what kind of sandbox it will set up given this or other setting. It throws in all kind of files of its own into the temporary directories it creates, which is very helpful to get things up and running quickly, but that doesn’t give a feeling of control.

Generally speaking, everything that isn’t explicitly handled by blacklisting or whitelisting (see below) is accessible in the sandbox just like outside it. In particular, it’s the user’s responsibility to hide away all those system-specific mounted filesystems (do you call them /mnt/storage?). If desired, of course.

Major disclaimer: This post is not authoritative in any way, and contains my jots as I get to know the beast. In particular, I may mislead you to think something is protected even though it’s not. You’re responsible to your own decisions.

The examples below are with firejail version 0.9.52 on a Linux Mint 19.

Install

# apt install firejail
# apt install firetools

By all means, go

$ man firejail

after installation. It’s also worth to look at /etc/firejail/ to get an idea on what protection measures are typically used.

Key commands

Launch FireTools, a GUI front end:

$ firetools &

And the “Tools” part has a nice listing of running sandboxes (right-click the ugly thing that comes up).

Now some command line examples. I name the sandboxes in these examples, but I’m not sure it’s worth bothering.

List existing sandboxes (or use FireTools, right-click the panel and choose Tools):

$ firejail --list

Assign a name to a sandbox when creating it

$ firejail --name=mysandbox firefox

Shut down a sandbox (kill all its processes, and clean up):

$ firejail --shutdown=mysandbox

If a name wasn’t assigned, the PID given in the list can be used instead.

Disallow the root user in the sandbox

$ firejail --noroot

Create overlay filesystem (mounts read/write, but changes are kept elsewhere)

$ firejail --overlay firefox

There’s also –overlay-tmpfs for overlaying tmpfs only, as well as –overlay-clean to clean the overlays, which are stored in $HOME/.firejail.

To create a completely new home directory (and /root) as temporary filesystems (private browsing style), so they are volatile:

$ firejail --private firefox

Better still,

$ firejail --private=/path/to/extra-homedir firefox

This uses the directory in the given path as a persistent home directory (some basic files are added automatically). This path can be anywhere in the filesystem, even in parts that are otherwise hidden (i.e. blacklisted) to the sandbox. So this is probably the most appealing choice in most scenarios.

Don’t get too excited, though: Other mounted filesystems remain unprotected (at different levels). This just protects the home directory.

By default, a whole bunch of security rules are loaded when firejail is invoked. To start the container without this:

$ firejail --noprofile

A profile can be selected with the –profile=filename flag.

Writing a profile

If you really want to have a sandbox that protects your computer with relation to a specific piece of software, you’ll probably have to write your own profile. It’s no big deal, except that it’s a bit of trial and error.

First read the manpage:

$ man firejail-profile

It’s easiest to start from a template: Launch FireTools from a shell, right-click the ugly thing that comes up, and pick “Configuration Wizard”, and create a custom security profile for one of the listed application — the one that resembles most the one for which the profile is set up.

Then launch the application from FireTools. The takeaway is that it writes out the configuration file to the console. Start with that.

Whilelisting and blacklisting

First and foremost: Always run a

$ df -h

inside the sandbox to get an idea of what is mounted. Blacklist anything that isn’t necessary. Doing so to entire mounts removes the related mount from the df -h list, which makes it easier to spot things that shouldn’t be there.

It’s also a good idea to start a sample bash session with the sandbox, and get into the File Manager in the Firetool’s “Tools” section for each sandbox.

But then, what is whitelisting and blacklisting, exactly? These two terms are used all over the docs, somehow assuming we know what they mean. So I’ll try to nail it down.

Whitelisting isn’t anywhere near what one would think it is: By whitelisting certain files and/or directories, the original files/directories appear in the sandbox but all other files in their vicinity are invisible. Also, changes in the same vicinity are temporary to the sandbox session. The idea seems to be that if files and/or directories are whitelisted, everything else close to it should be out of sight.

Or as put in the man page:

A temporary file system is mounted on the top directory, and the whitelisted files are mount-binded inside. Modifications to whitelisted files are persistent, everything else is discarded when the sandbox is closed. The top directory could be user home, /dev, /media, /mnt, /opt, /srv, /var, and /tmp.

So for example, if any file or directory in the home directory is whitelisted, the entire home directory becomes overridden by an almost empty home directory plus the specifically whitelisted items. For example, from my own home directory (which is populated with a lot of files):

$ firejail --noprofile --whitelist=/home/eli/this-directory
Parent pid 31560, child pid 31561
Child process initialized in 37.31 ms

$ find .
.
./.config
./.config/pulse
./.config/pulse/client.conf
./this-directory
./this-directory/this-file.txt
./.Xauthority
./.bashrc

So there’s just a few temporary files that firejail was kind enough to add for convenience. Changes made in this-directory/ are persistent since it’s bind-mounted into the temporary directory, but everything else is temporary.

Quite unfortunately, it’s not possible to whitelist a directory outside the specific list of hierarchies (unless bind mounting is used, but that requires root). So if the important stuff is one some /hugedisk, only a bind mount will help (or is this the punishment for not putting it has /mnt/hugedisk?).

But note that the –private= flag allows setting the home directory to anywhere on the filesystem (even inside a blacklisted region). This ad-hoc home directory is persistent, so it’s not like whitelisting, but even better is some scenarios.

Alternatively, it’s possible to blacklist everything but a certain part of a mount. That’s a bit tricky, because if a new directory appears after the rules are set, it remains unprotected. I’ll explain why below.

Or if that makes sense, make the entire directory tree read-only, with only a selected part read-write. That’s fine if there’s no issue with data leaking, just the possibility of malware sabotage.

So now to blacklisting: Firejail implements blacklisting by mounting an empty, read-only-by-root file or directory on top of the original file. And indeed,

$ firejail --blacklist=delme.txt
Reading profile /etc/firejail/default.profile
Reading profile /etc/firejail/disable-common.inc
Reading profile /etc/firejail/disable-passwdmgr.inc
Reading profile /etc/firejail/disable-programs.inc

** Note: you can use --noprofile to disable default.profile **

Parent pid 30288, child pid 30289
Child process initialized in 57.75 ms
$ ls -l
[ ... ]
-r--------  1 nobody nogroup     0 Jun  9 22:12 delme.txt
[ ... ]
$ less delme.txt
delme.txt: Permission denied

There are –noblacklist and –nowhitelist flags as well. However these merely cancel future or automatic black- or whitelistings. In particular, one can’t blacklist a directory and whitelist a subdirectory. It would have been very convenient, but since the parent directory is overridden with a whiteout directory, there is no access to the subdirectory. So each and every subdirectory must be blacklisted separately with a script or something, and even then if a new subdirectory pops up, it’s not protected at all.

There’s also a –read-only flag allows setting certain paths and files as read-only. There’s –read-write too, of course. When a directory or file is whitelisted, it must be flagged read-only separately if so desired (see man firejail).

Mini-strace

Trace all processes in the sandbox (in particular accesses to files and network). Much easier than using strace, when all we want is “which files are accessed?”

$ firejail --trace

And then just run any program to see what files and network sockets it accesses. And things of that sort.

MySQL, OOM killer, overcommitting and other memory related issues

eli — Sun, 13 Oct 2019 17:21:46 +0000

It started with an error message

This post is a bit of a coredump of myself attempting to resolve a sudden web server failure. And even more important, understand why it happened (check on that) and try avoiding it from happening in the future (not as lucky there).

I’ve noticed that there are many threads in the Internet on why mysqld died suddenly, so to make a long story short: mysqld has the exact profile that the OOM killer is looking for: Lots of resident RAM, and it’s not a system process. Apache gets killed every now and then for the same reason.

This post relates to a VPS hosted Debian 8, kernel 3.10.0, x86_64. The MySQL server is a 5.5.62-0+deb8u1 (Debian).

As always, it started with a mail notification from some cronjob complaining about something. Soon enough it was evident that the MySQL server was down. And as usual, the deeper I investigated this issue, the more I realized that this was just the tip of the iceberg (the kind that doesn’t melt due to global warming).

The crash

So first, it was clear that the MySQL had restarted itself a couple of days before disaster:

191007  9:25:17 [Warning] Using unique option prefix myisam-recover instead of myisam-recover-options is deprecated and will be removed in a future release. Please use the full name instead.
191007  9:25:17 [Note] Plugin 'FEDERATED' is disabled.
191007  9:25:17 InnoDB: The InnoDB memory heap is disabled
191007  9:25:17 InnoDB: Mutexes and rw_locks use GCC atomic builtins
191007  9:25:17 InnoDB: Compressed tables use zlib 1.2.8
191007  9:25:17 InnoDB: Using Linux native AIO
191007  9:25:17 InnoDB: Initializing buffer pool, size = 128.0M
191007  9:25:17 InnoDB: Completed initialization of buffer pool
191007  9:25:17 InnoDB: highest supported file format is Barracuda.
InnoDB: The log sequence number in ibdata files does not match
InnoDB: the log sequence number in the ib_logfiles!
191007  9:25:17  InnoDB: Database was not shut down normally!
InnoDB: Starting crash recovery.
InnoDB: Reading tablespace information from the .ibd files...
InnoDB: Restoring possible half-written data pages from the doublewrite
InnoDB: buffer...
191007  9:25:19  InnoDB: Waiting for the background threads to start
191007  9:25:20 InnoDB: 5.5.62 started; log sequence number 1427184442
191007  9:25:20 [Note] Server hostname (bind-address): '127.0.0.1'; port: 3306
191007  9:25:20 [Note]   - '127.0.0.1' resolves to '127.0.0.1';
191007  9:25:20 [Note] Server socket created on IP: '127.0.0.1'.
191007  9:25:21 [Note] Event Scheduler: Loaded 0 events
191007  9:25:21 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.62-0+deb8u1'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  (Debian)
191007  9:25:28 [ERROR] /usr/sbin/mysqld: Table './mydb/wp_options' is marked as crashed and should be repaired
191007  9:25:28 [Warning] Checking table:   './mydb/wp_options'
191007  9:25:28 [ERROR] /usr/sbin/mysqld: Table './mydb/wp_posts' is marked as crashed and should be repaired
191007  9:25:28 [Warning] Checking table:   './mydb/wp_posts'
191007  9:25:28 [ERROR] /usr/sbin/mysqld: Table './mydb/wp_term_taxonomy' is marked as crashed and should be repaired
191007  9:25:28 [Warning] Checking table:   './mydb/wp_term_taxonomy'
191007  9:25:28 [ERROR] /usr/sbin/mysqld: Table './mydb/wp_term_relationships' is marked as crashed and should be repaired
191007  9:25:28 [Warning] Checking table:   './mydb/wp_term_relationships'

And then, two days layer, it crashed for real. Or actually, got killed. From the syslog:

Oct 09 05:30:16 kernel: OOM killed process 22763 (mysqld) total-vm:2192796kB, anon-rss:128664kB, file-rss:0kB

and

191009  5:30:17 [Warning] Using unique option prefix myisam-recover instead of myisam-recover-options is deprecated and will be removed in a future release. Please use the full name instead.
191009  5:30:17 [Note] Plugin 'FEDERATED' is disabled.
191009  5:30:17 InnoDB: The InnoDB memory heap is disabled
191009  5:30:17 InnoDB: Mutexes and rw_locks use GCC atomic builtins
191009  5:30:17 InnoDB: Compressed tables use zlib 1.2.8
191009  5:30:17 InnoDB: Using Linux native AIO
191009  5:30:17 InnoDB: Initializing buffer pool, size = 128.0M
InnoDB: mmap(137363456 bytes) failed; errno 12
191009  5:30:17 InnoDB: Completed initialization of buffer pool
191009  5:30:17 InnoDB: Fatal error: cannot allocate memory for the buffer pool
191009  5:30:17 [ERROR] Plugin 'InnoDB' init function returned error.
191009  5:30:17 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
191009  5:30:17 [ERROR] Unknown/unsupported storage engine: InnoDB
191009  5:30:17 [ERROR] Aborting

191009  5:30:17 [Note] /usr/sbin/mysqld: Shutdown complete

The mmap() is most likely anonymous (i.e. not related to a file), as I couldn’t find any memory mapped file that is related to the mysql processes (except for the obvious mappings of shared libraries).

The smoking gun

But here comes the good part: It turns out that the OOM killer had been active several times before. It just so happen that the processes are being newborn every time this happens. It was the relaunch that failed this time — otherwise I wouldn’t have noticed this was going on.

This is the output of plain “dmesg”. All OOM entries but the last one were not available with journalctl, as old entries had been deleted.

[3634197.152028] OOM killed process 776 (mysqld) total-vm:2332652kB, anon-rss:153508kB, file-rss:0kB
[3634197.273914] OOM killed process 71 (systemd-journal) total-vm:99756kB, anon-rss:68592kB, file-rss:4kB
[4487991.904510] OOM killed process 3817 (mysqld) total-vm:2324456kB, anon-rss:135752kB, file-rss:0kB
[4835006.413510] OOM killed process 23267 (mysqld) total-vm:2653112kB, anon-rss:131272kB, file-rss:4404kB
[4835006.767112] OOM killed process 32758 (apache2) total-vm:282528kB, anon-rss:11732kB, file-rss:52kB
[4884915.371805] OOM killed process 825 (mysqld) total-vm:2850312kB, anon-rss:121164kB, file-rss:5028kB
[4884915.509686] OOM killed process 17611 (apache2) total-vm:282668kB, anon-rss:11736kB, file-rss:444kB
[5096265.088151] OOM killed process 23782 (mysqld) total-vm:4822232kB, anon-rss:105972kB, file-rss:3784kB
[5845437.591031] OOM killed process 24642 (mysqld) total-vm:2455744kB, anon-rss:137784kB, file-rss:0kB
[5845437.608682] OOM killed process 3802 (systemd-journal) total-vm:82548kB, anon-rss:51412kB, file-rss:28kB
[6896254.741732] OOM killed process 11551 (mysqld) total-vm:2718652kB, anon-rss:144116kB, file-rss:220kB
[7054957.856153] OOM killed process 22763 (mysqld) total-vm:2192796kB, anon-rss:128664kB, file-rss:0kB

Or, after calculating the time stamps (using the last OOM message as a reference):

Fri Aug 30 15:17:36 2019 OOM killed process 776 (mysqld) total-vm:2332652kB, anon-rss:153508kB, file-rss:0kB
Fri Aug 30 15:17:36 2019 OOM killed process 71 (systemd-journal) total-vm:99756kB, anon-rss:68592kB, file-rss:4kB
Mon Sep  9 12:27:30 2019 OOM killed process 3817 (mysqld) total-vm:2324456kB, anon-rss:135752kB, file-rss:0kB
Fri Sep 13 12:51:05 2019 OOM killed process 23267 (mysqld) total-vm:2653112kB, anon-rss:131272kB, file-rss:4404kB
Fri Sep 13 12:51:05 2019 OOM killed process 32758 (apache2) total-vm:282528kB, anon-rss:11732kB, file-rss:52kB
Sat Sep 14 02:42:54 2019 OOM killed process 825 (mysqld) total-vm:2850312kB, anon-rss:121164kB, file-rss:5028kB
Sat Sep 14 02:42:54 2019 OOM killed process 17611 (apache2) total-vm:282668kB, anon-rss:11736kB, file-rss:444kB
Mon Sep 16 13:25:24 2019 OOM killed process 23782 (mysqld) total-vm:4822232kB, anon-rss:105972kB, file-rss:3784kB
Wed Sep 25 05:31:36 2019 OOM killed process 24642 (mysqld) total-vm:2455744kB, anon-rss:137784kB, file-rss:0kB
Wed Sep 25 05:31:36 2019 OOM killed process 3802 (systemd-journal) total-vm:82548kB, anon-rss:51412kB, file-rss:28kB
Mon Oct  7 09:25:13 2019 OOM killed process 11551 (mysqld) total-vm:2718652kB, anon-rss:144116kB, file-rss:220kB
Wed Oct  9 05:30:16 2019 OOM killed process 22763 (mysqld) total-vm:2192796kB, anon-rss:128664kB, file-rss:0kB

So first, what do these mean numbers mean? There doesn’t seem to be an authoritative information source about this, but judging from different sources on the web, it goes like this:

total-vm is the total size of the Virtual Memory in use. This isn’t very relevant (I think), as it involves shared libraries, memory mapped files and other segments that don’t consume any actual RAM or other valuable resources.
anon-rss is the resident in physical RAM consumed by the process itself (anonymous = not memory mapped to a file or something like that).
file-rss is the amount of memory that is resident in physical RAM, and is memory mapped to a file (for example, the executable binary).

Jusding from “top”, it’s quite typical for the mysql daemon to have a virtual memory allocation of about 4 GB, and resident memory of about 100-150 MB. The file-rss is most likely the database itself, that happened to be memory mapped (if at all) when the OOM happened to look for a victim.

So now it’s clear what happened, and it’s also quite clear that the mysql daemon did nothing irregular to get killed.

The MySQL keepaliver

The MySQL daemon is executed by virtue of an SysV init script, which launches /usr/bin/mysqld_safe, a patch-on-patch script to keep the daemon alive, no matter what. It restarts the mysqld daemon if it dies for any or no reason, and should also produce log messages. On my system, it’s executed as

/usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib/mysql/plugin --user=mysql --log-error=/var/log/mysql/error.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/run/mysqld/mysqld.sock --port=3306

The script issues log messages when something unexpected happens, but they don’t appear in /var/log/mysql/error.log or anywhere else, even though the file exists, is owned by the mysql user, and has quite a few messages from the mysql daemon itself.

Changing

/usr/bin/mysqld_safe > /dev/null 2>&1 &

/usr/bin/mysqld_safe --syslog > /dev/null 2>&1 &

Frankly speaking, I don’t think this made any difference. I’ve seen nothing new in the logs.

It would have been nicer having the messages in mysql/error.log, but at least they are visible with journalctl this way.

Shrinking the InnoDB buffer pool

As the actual failure was on attempting to map memory for the buffer pool, maybe make it smaller…?

Launch MySQL as the root user:

$ mysql -u root --password

and check the InnoDB status, as suggested on this page:

mysql> SHOW ENGINE INNODB STATUS;

[ ... ]

----------------------
BUFFER POOL AND MEMORY
----------------------
Total memory allocated 137363456; in additional pool allocated 0
Dictionary memory allocated 1100748
Buffer pool size   8192
Free buffers       6263
Database pages     1912
Old database pages 725
Modified db pages  0
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 0, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 1894, created 18, written 1013
0.00 reads/s, 0.00 creates/s, 0.26 writes/s
Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 1912, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]

I’m really not an expert, but if “Free buffers” is 75% of the total allocated space, I’ve probably allocated too much. So I reduced it to 32 MB — it’s not like I’m running a high-end server. I added /etc/mysql/conf.d/innodb_pool_size.cnf (owned by root, 0644) reading:

# Reduce InnoDB buffer size from default 128 MB to 32 MB
[mysqld]
innodb_buffer_pool_size=32M

Restarting the daemon, it says:

----------------------
BUFFER POOL AND MEMORY
----------------------
Total memory allocated 34340864; in additional pool allocated 0
Dictionary memory allocated 1100748
Buffer pool size   2047
Free buffers       856
Database pages     1189
Old database pages 458

And finally, repair the tables

Remember those warnings that the tables were marked as crashed? That’s the easy part:

$ mysqlcheck -A --auto-repair

That went smoothly, with no complaints. After all, it wasn’t really a crash.

Some general words on OOM

This whole idea that the kernel should do Roman Empire style decimation of processes is widely criticized by many, but it’s probably not such a bad idea. The root cause lies in the fact that the kernel agrees to allocate more RAM than it actually has. This is even possible because the kernel doesn’t really allocate RAM when a process asks for memory with a brk() call, but it only allocates the memory space segment. The actual RAM is allocated only when the process attempts to access a page that hasn’t been RAM allocated yet. The access attempt causes a page fault, the kernel quickly fixes some RAM and returns from the page fault interrupt as if nothing happened.

So when the kernel responds with an -ENOMEM, it’s not because it doesn’t have any RAM, but because it doesn’t want to.

More precisely, the kernel keeps account on how much memory it has given away (system-wise and/or cgroup-wise) and make a decision. The common policy is to overcommit to some extent — that is, to allow the total allocated RAM allocated to exceed the total physical RAM. Even, and in particular, if there’s no swap.

The common figure is to overcommit by 50%: For a 64 GiB RAM computer, there might be 96 GiB or promised RAM. This may seem awfully stupid thing to do, but hey, it works. If that concept worries you, modern banking (with real money, that is) might worry you even more.

The problem rises when the processes run to the bank. That is, when the processes access the RAM they’ve been promised, and at some point the kernel has nowhere to take memory from. Let’s assume there’s no swap, all disk buffers have been flushed, all rabbits have been pulled. There’s a process waiting for memory, and it can’t go back running until the problem has been resolved.

Linux’ solution to this situation is to select a process with a lot of RAM and little importance. How the kernel does that judgement is documented everywhere. The important point is that it’s not necessarily the process that triggered the event, and that it will usually be the same victim over and over again. In my case, mysqld is the favorite. Big, fat, and not a system process.

Thinking about it, the OOM is a good solution to get out of a tricky situation. The alternative would have been to deny memory to processes just launched, including the administrator’s attempt to rescue the system. Or an attempt to shut it down with some dignity. So sacrificing a large and hopefully not-so-important process isn’t such a bad idea.

Why did the OOM kick in?

This all took place on a VPS virtual machine with 1 GB leased RAM. With the stuff running on that machine, there’s no reason in the world that the total actual RAM consumption would reach that limit. This is a system that typically has 70% of its memory marked as “cached” (i.e. used by disk cache). This should be taken with a grain of salt, as “top” displays data from some bogus /proc/vmstat, and still. See below on how to check the actual memory consumption.

As can be seen in the dmesg logs above, the amount of resident RAM of the killed mysqld process was 120-150 MB or so. Together with the other memory hog, apache2, they reach 300 MB. That’s it. No reason for anything drastic.

Having said that, it’s remarkable that the total-vm stood at 2.4-4.3 GB when killed. This is much higher than the typical 900 MB visible usually. So maybe there’s some kind of memory leak, even if it’s harmless? Looking at mysql over time, its virtual memory allocation tends to grow.

VPS machines do have a physical memory limit imposed, by virtue of the relevant cgroup’s memory.high and memory.max limits. In particular the latter — if the cgroup’s total consumption exceeds memory.max, OOM kicks in. This is how the illusion of an independent RAM segment is made on a VPS machine. Plus faking some /proc files.

But there’s another explanation: Say that a VPS service provider takes a computer with 16 GB RAM, and places 16 VPS machines with 1 GB leased RAM each. What will the overall actual RAM consumption be? I would expect it to be much lower than 16 GB. So why not add a few more VPS machines, and make some good use of the hardware? It’s where the profit comes from.

Most of the time, there will be no problem. But occasionally, this will cause RAM shortages, in which case the kernel’s global OOM looks for a victim. I suppose there’s no significance to cgroups in this matter. In other words, the kernel sees all processes in the system the same, regardless of which cgroup (and hence VPS machine) they belong to. Which means that the process killed doesn’t necessarily belong to the VPS that triggered the problem. The processes of one VPS may suddenly demand their memory, but some other VPS will have its processes killed.

Conclusion

Shrinking the buffer pool of mysqld was probably a good idea, in particular if a computer-wide OOM killed the process — odds are that it will kill some other mysqld instead this way.
Possibly restart mysql with a cronjob every day to keep its memory consumption in control. But this might create problems of its own.
It’s high time to replace the VPS guest with KVM or similar.

Does my VPS need more RAM?

There’s “free” and “top” and several other utilities for telling you the status of the memory, but they don’t answer a simple question: Do the application eat up too much memory?

So the way to tell, is asking /proc/meminfo directly (note that this was made on my openVZ machine, not bare metal):

$ cat /proc/meminfo
MemTotal:        1048576 kB
MemFree:           34256 kB
MemAvailable:     741536 kB
Cached:           719728 kB
Buffers:               0 kB
Active:           485536 kB
Inactive:         454964 kB
Active(anon):     142584 kB
Inactive(anon):   110608 kB
Active(file):     342952 kB
Inactive(file):   344356 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       1048576 kB
SwapFree:         984960 kB
Dirty:                48 kB
Writeback:             0 kB
AnonPages:        253192 kB
Shmem:             32420 kB
Slab:              35164 kB
SReclaimable:      19972 kB
SUnreclaim:        15192 kB

The tricky part about memory estimation is that the kernel attempts to use all RAM it has for something. So a lot goes to disk cache (“Cached”, 719728 kB in the example). This memory is reclaimed immediately if needed by an application, so it should be counted as free RAM.

Therefore, MemAvailable is the place to look. It’s a rough estimation of how much memory applications can ask for, and comparing it with MemTotal, clearly 70% of the memory is free. So the VPS server has plenty of RAM.

And yes, it’s a bit surpsiring that each VPS has its own disk cache, but it actually makes sense. Why should one guest wipe out another one’s disk cache?

There’s also swap memory allocated on the machine, almost all of which is unused. This is a common situation when there’s no shortage of memory. The amount of swap memory in use (SwapTotal – SwapFree) should be added to calculation on how much memory application use (63616 kB in the case above). So in reality, applications eat up 307040 kB of real RAM and 63616 kB of swap, total 370656 kB. That’s 35% of the real RAM.

The reason I get into the topic of swap is that one of the answers I got from my VPS service provider’s support was that the problem with swap is that all VPS machines have a common pool of swap on the disk, so getting swap can fail. But that doesn’t explain an OOM kill, as there’s plenty of real RAM to begin with.

If you insist looking at the output of “top”, the place is “cached Mem” + free:

KiB Mem:   1048576 total,   990028 used,    58548 free,        0 buffers
KiB Swap:  1048576 total,    62756 used,   985820 free.   715008 cached Mem

but each “top” utility displays this data differently. In this case, putting it on the same line as info on swap memory is misleading.

Or, using “free”:

$ free
           total       used       free     shared    buffers     cached
Mem:       1048576    1000164      48412      35976          0     715888
-/+ buffers/cache:     284276     764300
Swap:      1048576      62716     985860

Once again, it’s cached memory + free. With this utility, they are both on the same line, as they should be.

—————————————————————-

Rambling epilogue: Some thoughts about overcomitting

The details for how overcomitting is accounted for is given on the kernel tree’s Documentation/vm/overcommit-accounting. But to make a long story short, it’s done in a sensible way. In particular, if a piece of memory is shared by threads and processes, it’s only accounted for once.

Relevant files: /proc/meminfo and /proc/vmstat

It seems like CommitLimit and Committed_AS are not available on a VPS guest system. But the OOM killer probably knows these values (or was it because /proc/sys/vm/overcommit_memory was set to 1 on my system, meaning “Always overcommit”?).

To get a list of the current memory hogs, run “top” and press shift-M as it’s running.

To get an idea on how a process behaves, use pmap -x. For example, looking at a mysqld process (run as root, or no memory map will be shown):

# pmap -x 14817
14817:   /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib/mysql/plugin --user=mysql --log-error=/var/log/mysql/error.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/run/mysqld/mysqld.sock --port=3306
Address           Kbytes     RSS   Dirty Mode  Mapping
000055c5617ac000   10476    6204       0 r-x-- mysqld
000055c5623e6000     452     452     452 r---- mysqld
000055c562457000     668     412     284 rw--- mysqld
000055c5624fe000     172     172     172 rw---   [ anon ]
000055c563e9b000    6592    6448    6448 rw---   [ anon ]
00007f819c000000    2296     320     320 rw---   [ anon ]
00007f819c23e000   63240       0       0 -----   [ anon ]
00007f81a0000000    3160     608     608 rw---   [ anon ]
00007f81a0316000   62376       0       0 -----   [ anon ]
00007f81a4000000    9688    7220    7220 rw---   [ anon ]
00007f81a4976000   55848       0       0 -----   [ anon ]
00007f81a8000000     132       8       8 rw---   [ anon ]
00007f81a8021000   65404       0       0 -----   [ anon ]
00007f81ac000000     132       4       4 rw---   [ anon ]
00007f81ac021000   65404       0       0 -----   [ anon ]
00007f81b1220000       4       0       0 -----   [ anon ]
00007f81b1221000    8192       8       8 rw---   [ anon ]
00007f81b1a21000       4       0       0 -----   [ anon ]
00007f81b1a22000    8192       8       8 rw---   [ anon ]
00007f81b2222000       4       0       0 -----   [ anon ]
00007f81b2223000    8192       8       8 rw---   [ anon ]
00007f81b2a23000       4       0       0 -----   [ anon ]
00007f81b2a24000    8192      20      20 rw---   [ anon ]
00007f81b3224000       4       0       0 -----   [ anon ]
00007f81b3225000    8192       8       8 rw---   [ anon ]
00007f81b3a25000       4       0       0 -----   [ anon ]
00007f81b3a26000    8192       8       8 rw---   [ anon ]
00007f81b4226000       4       0       0 -----   [ anon ]
00007f81b4227000    8192       8       8 rw---   [ anon ]
00007f81b4a27000       4       0       0 -----   [ anon ]
00007f81b4a28000    8192       8       8 rw---   [ anon ]
00007f81b5228000       4       0       0 -----   [ anon ]
00007f81b5229000    8192       8       8 rw---   [ anon ]
00007f81b5a29000       4       0       0 -----   [ anon ]
00007f81b5a2a000    8192       8       8 rw---   [ anon ]
00007f81b622a000       4       0       0 -----   [ anon ]
00007f81b622b000    8192      12      12 rw---   [ anon ]
00007f81b6a2b000       4       0       0 -----   [ anon ]
00007f81b6a2c000    8192       8       8 rw---   [ anon ]
00007f81b722c000       4       0       0 -----   [ anon ]
00007f81b722d000   79692   57740   57740 rw---   [ anon ]
00007f81bc000000     132      76      76 rw---   [ anon ]
00007f81bc021000   65404       0       0 -----   [ anon ]
00007f81c002f000    2068    2052    2052 rw---   [ anon ]
00007f81c03f9000       4       0       0 -----   [ anon ]
00007f81c03fa000     192      52      52 rw---   [ anon ]
00007f81c042a000       4       0       0 -----   [ anon ]
00007f81c042b000     192      52      52 rw---   [ anon ]
00007f81c045b000       4       0       0 -----   [ anon ]
00007f81c045c000     192      64      64 rw---   [ anon ]
00007f81c048c000       4       0       0 -----   [ anon ]
00007f81c048d000     736     552     552 rw---   [ anon ]
00007f81c0545000      20       4       0 rw-s- [aio] (deleted)
00007f81c054a000      20       4       0 rw-s- [aio] (deleted)
00007f81c054f000    3364    3364    3364 rw---   [ anon ]
00007f81c0898000      44      12       0 r-x-- libnss_files-2.19.so
00007f81c08a3000    2044       0       0 ----- libnss_files-2.19.so
00007f81c0aa2000       4       4       4 r---- libnss_files-2.19.so
00007f81c0aa3000       4       4       4 rw--- libnss_files-2.19.so
00007f81c0aa4000      40      20       0 r-x-- libnss_nis-2.19.so
00007f81c0aae000    2044       0       0 ----- libnss_nis-2.19.so
00007f81c0cad000       4       4       4 r---- libnss_nis-2.19.so
00007f81c0cae000       4       4       4 rw--- libnss_nis-2.19.so
00007f81c0caf000      28      20       0 r-x-- libnss_compat-2.19.so
00007f81c0cb6000    2044       0       0 ----- libnss_compat-2.19.so
00007f81c0eb5000       4       4       4 r---- libnss_compat-2.19.so
00007f81c0eb6000       4       4       4 rw--- libnss_compat-2.19.so
00007f81c0eb7000       4       0       0 -----   [ anon ]
00007f81c0eb8000    8192       8       8 rw---   [ anon ]
00007f81c16b8000      84      20       0 r-x-- libnsl-2.19.so
00007f81c16cd000    2044       0       0 ----- libnsl-2.19.so
00007f81c18cc000       4       4       4 r---- libnsl-2.19.so
00007f81c18cd000       4       4       4 rw--- libnsl-2.19.so
00007f81c18ce000       8       0       0 rw---   [ anon ]
00007f81c18d0000    1668     656       0 r-x-- libc-2.19.so
00007f81c1a71000    2048       0       0 ----- libc-2.19.so
00007f81c1c71000      16      16      16 r---- libc-2.19.so
00007f81c1c75000       8       8       8 rw--- libc-2.19.so
00007f81c1c77000      16      16      16 rw---   [ anon ]
00007f81c1c7b000      88      44       0 r-x-- libgcc_s.so.1
00007f81c1c91000    2044       0       0 ----- libgcc_s.so.1
00007f81c1e90000       4       4       4 rw--- libgcc_s.so.1
00007f81c1e91000    1024     128       0 r-x-- libm-2.19.so
00007f81c1f91000    2044       0       0 ----- libm-2.19.so
00007f81c2190000       4       4       4 r---- libm-2.19.so
00007f81c2191000       4       4       4 rw--- libm-2.19.so
00007f81c2192000     944     368       0 r-x-- libstdc++.so.6.0.20
00007f81c227e000    2048       0       0 ----- libstdc++.so.6.0.20
00007f81c247e000      32      32      32 r---- libstdc++.so.6.0.20
00007f81c2486000       8       8       8 rw--- libstdc++.so.6.0.20
00007f81c2488000      84       8       8 rw---   [ anon ]
00007f81c249d000      12       8       0 r-x-- libdl-2.19.so
00007f81c24a0000    2044       0       0 ----- libdl-2.19.so
00007f81c269f000       4       4       4 r---- libdl-2.19.so
00007f81c26a0000       4       4       4 rw--- libdl-2.19.so
00007f81c26a1000      32       4       0 r-x-- libcrypt-2.19.so
00007f81c26a9000    2044       0       0 ----- libcrypt-2.19.so
00007f81c28a8000       4       4       4 r---- libcrypt-2.19.so
00007f81c28a9000       4       4       4 rw--- libcrypt-2.19.so
00007f81c28aa000     184       0       0 rw---   [ anon ]
00007f81c28d8000      36      28       0 r-x-- libwrap.so.0.7.6
00007f81c28e1000    2044       0       0 ----- libwrap.so.0.7.6
00007f81c2ae0000       4       4       4 r---- libwrap.so.0.7.6
00007f81c2ae1000       4       4       4 rw--- libwrap.so.0.7.6
00007f81c2ae2000       4       4       4 rw---   [ anon ]
00007f81c2ae3000     104      12       0 r-x-- libz.so.1.2.8
00007f81c2afd000    2044       0       0 ----- libz.so.1.2.8
00007f81c2cfc000       4       4       4 r---- libz.so.1.2.8
00007f81c2cfd000       4       4       4 rw--- libz.so.1.2.8
00007f81c2cfe000       4       4       0 r-x-- libaio.so.1.0.1
00007f81c2cff000    2044       0       0 ----- libaio.so.1.0.1
00007f81c2efe000       4       4       4 r---- libaio.so.1.0.1
00007f81c2eff000       4       4       4 rw--- libaio.so.1.0.1
00007f81c2f00000      96      84       0 r-x-- libpthread-2.19.so
00007f81c2f18000    2044       0       0 ----- libpthread-2.19.so
00007f81c3117000       4       4       4 r---- libpthread-2.19.so
00007f81c3118000       4       4       4 rw--- libpthread-2.19.so
00007f81c3119000      16       4       4 rw---   [ anon ]
00007f81c311d000     132     112       0 r-x-- ld-2.19.so
00007f81c313e000       8       0       0 rw---   [ anon ]
00007f81c3140000      20       4       0 rw-s- [aio] (deleted)
00007f81c3145000      20       4       0 rw-s- [aio] (deleted)
00007f81c314a000      20       4       0 rw-s- [aio] (deleted)
00007f81c314f000      20       4       0 rw-s- [aio] (deleted)
00007f81c3154000      20       4       0 rw-s- [aio] (deleted)
00007f81c3159000      20       4       0 rw-s- [aio] (deleted)
00007f81c315e000      20       4       0 rw-s- [aio] (deleted)
00007f81c3163000      20       4       0 rw-s- [aio] (deleted)
00007f81c3168000    1840    1840    1840 rw---   [ anon ]
00007f81c3334000       8       0       0 rw-s- [aio] (deleted)
00007f81c3336000       4       0       0 rw-s- [aio] (deleted)
00007f81c3337000      24      12      12 rw---   [ anon ]
00007f81c333d000       4       4       4 r---- ld-2.19.so
00007f81c333e000       4       4       4 rw--- ld-2.19.so
00007f81c333f000       4       4       4 rw---   [ anon ]
00007ffd2d68b000     132      68      68 rw---   [ stack ]
00007ffd2d7ad000       8       4       0 r-x--   [ anon ]
ffffffffff600000       4       0       0 r-x--   [ anon ]
---------------- ------- ------- -------
total kB          640460   89604   81708

The KBytes and RSS column’s Total at the bottom matches the VIRT and RSS figures shown by “top”.

I should emphasize that this a freshly started mysqld process. Give it a few days to run, and some extra 100 MB of virtual space is added (not clear why) plus some real RAM, depending on the setting.

I’ve marked six anonymous segments that are completely virtual (no resident memory at all) summing up to ~360 MB. This means that they are counted in as 360 MB at least once — and that’s for a process that only uses 90 MB for real.

My own anecdotal test on another machine with a 4.4.0 kernel showed that putting /proc/sys/vm/overcommit_ratio below what was actually committed (making /proc/meminfo’s CommitLimit smaller than Committed_AS) didn’t have any effect unless /proc/sys/vm/overcommit_memory was set to 2. And when I did that, the OOM wasn’t called, but instead I had a hard time running new commands:

# echo 2 > /proc/sys/vm/overcommit_memory
# cat /proc/meminfo
bash: fork: Cannot allocate memory

So this is what it looks like when memory runs out and the system refuses to play ball.

Upgrading to Linux Mint 19, running the old system in a chroot

eli — Thu, 29 Nov 2018 20:30:10 +0000

Background

Archaeological findings have revealed that prehistoric humans buried their forefathers under the floor of their huts. Fast forward to 2018, yours truly decided to continue running the (ancient) Fedora 12 as a chroot when migrating to Linux Mint 19. That’s an eight years difference.

While a lot of Linux users are happy to just install the new system and migrate everything “automatically”, this isn’t a good idea if you’re into more than plain tasks. Upgrading is supposed to be smooth, but small changes in the default behavior, API or whatever always make things that worked before fail, and sometimes with significant damage. Of the sort of not receiving emails, backup jobs not really working as before etc. Or just a new bug.

I’ve talked with quite a few sysadmins who were responsible for computers that actually needed to work continuously and reliably, and it’s not long before the apology for their ancient Linux distribution arrived. There’s no need to apologize: Upgrading is not good for keeping the system running smoothly. If it ain’t broke, don’t fix it.

But after some time, the hardware gets old and it becomes difficult to install new software. So I had this idea to keep running the old computer, with all of its properly running services and cronjobs, as a virtual machine. And then I thought, maybe go VPS-style. And then I realized I don’t need the VPS isolation at all. So the idea is to keep the old system as a chroot inside the new one.

Some services (httpd, mail handling, dhcpd) will keep running in the chroot, and others (the desktop in particular, with new shiny GUI programs) running natively. Old and new on the same machine.

The trick is making sure one doesn’t stamp on the feet of the other. These are my insights as I managed to get this up and running.

The basics

The idea is to place the old root filesystem (only) into somewhere in the new system, and chroot into it for the sake of running services and oldschool programs:

The old root is placed as e.g. /oldy-root/ in the new filesystem (note that oldy is a legit alternative spelling for oldie…).
bind-mounts are used for a unified view of home directories and those containing data.
Some services are executed from within the chroot environment. How to run them from Mint 19 (hence using systemd) is described below.
Running old programs is also possible by chrooting from shell. This is also discussed below.

Don’t put the old root on a filesystem that contains useful data, because odds are that such file system will be bind-mounted into the chrooted filesystem, which will cause a directory tree loop. Then try to calculate disk space or backup with tar. So pick a separate filesystem (i.e. a separate partition or LVM volume), or possibly as a subdirectory of the same filesystem as the “real” root.

Bind mounting

This is where the tricky choices are made. The point is to make the old and new systems see more or less the same application data, and also allow software to communicate over /tmp. So this is the relevant part in my /etc/fstab:

# Bind mounts for oldy root: system essentials
/dev                        /oldy-root/dev none bind                0       2
/dev/pts                    /oldy-root/dev/pts none bind            0       2
/dev/shm                    /oldy-root/dev/shm none bind            0       2
/sys                        /oldy-root/sys none bind                0       2
/proc                       /oldy-root/proc none bind               0       2

# Bind mounts for oldy root: Storage
/home                       /oldy-root/home none bind               0       2
/storage                    /oldy-root/storage none bind            0       2
/tmp                        /oldy-root/tmp  none bind               0       2
/mnt                        /oldy-root/mnt  none bind               0       2
/media                      /oldy-root/media none bind              0       2

Most notable are /mnt and /media. Bind-mounting these allows temporary mounts to be visible at both sides. /tmp is required for the UNIX domain socket used for playing sound from the old system. And other sockets, I suppose.

Note that /run isn’t bind-mounted. The reason is that the tree structure has changed, so it’s quite pointless (the mounting point used to be /var/run, and the place of the runtime files tend to change with time). The motivation for bind mounting would have been to let software from the old and new software interact, and indeed, there are a few UNIX sockets there, most notably the DBus domain UNIX socket.

But DBus is a good example of how hopeless it is to bind-mount /run: Old software attempting to talk with the Console Kit on the new DBus server fails completely at the protocol level (or namespace? I didn’t really dig into that).

So just copy the old /var/run into the root filesystem and that’s it. CUPS ran smoothly, GUI programs run fairly OK, and sound is done through a UNIX domain socket as suggested in the comments of this post.

I opted out on bind mounting /lib/modules and /usr/src. This makes manipulations of kernel modules (as needed by VMware, for example) impossible from the old system. But gcc is outdated for compiling under the new Linux kernel build system, so there was little point.

/root isn’t bind-mounted either. I wasn’t so sure about that, but in the end, it’s not a very useful directory. Keeping them separate makes the shell history for the root user distinct, and that’s actually a good thing.

Make /dev/log for real

Almost all service programs (and others) send messages to the system log by writing to the UNIX domain socket /dev/log. It’s actually a misnomer, because /dev/log is not a device file. But you don’t break tradition.

WARNING: If the logging server doesn’t work properly, Linux will fail to boot, dropping you into a tiny busybox rescue shell. So before playing with this, reboot to verify all is fine, and then make the changes. Be sure to prepare yourself for reverting your changes with plain command-line utilities (cp, mv, cat) and reboot to make sure all is fine.

In Mint 19 (and forever on), logging is handled by systemd-journald, which is a godsend. However for some reason (does anyone know why? Kindly comment below), the UNIX domain socket it creates is placed at /run/systemd/journal/dev-log, and /dev/log is a symlink to it. There are a few bug reports out there on software refusing to log into a symlink.

But that’s small potatoes: Since I decided not to bind-mount /run, there’s no access to this socket from the old system.

The solution is to swap the two: Make /dev/log the UNIX socket (as it was before), and /run/systemd/journal/dev-log the symlink (I wonder if the latter is necessary). To achieve this, copy /lib/systemd/system/systemd-journald-dev-log.socket into /etc/systemd/system/systemd-journald-dev-log.socket. This will make the latter override the former (keep the file name accurate), and make the change survive possible upgrades — the file in /lib can be overwritten by apt, the one in /etc won’t be by convention.

Edit the file in /etc, in the part saying:

[Socket]
Service=systemd-journald.service
ListenDatagram=/run/systemd/journal/dev-log
Symlinks=/dev/log
SocketMode=0666
PassCredentials=yes
PassSecurity=yes

and swap the files, making it

ListenDatagram=/dev/log
Symlinks=/run/systemd/journal/dev-log

instead.

All in all this works perfectly. Old programs work well (try “logger” command line utility on both sides). This can cause problems if the program expects “the real thing” on /run/systemd/journal/dev-log. Quite unlikely.

As a side note, I had this idea to make journald listen to two UNIX domain sockets: Dropping the Symlinks assignment in the original .socket file, and copying it into a new .socket file, setting ListenDatagram to /dev/log. Two .socket files, two UNIX sockets. Sounded like a good idea, only it failed with an error message saying “Too many /dev/log sockets passed”.

Running old services

systemd’s take on sysV-style services (i.e. those init.d, rcN.d scripts) is that when systemctl is called with reference to a service, it first tries with its native services, and if none is found, it looks for a service of that name in /etc/init.d.

In order to run old services, I wrote a catch-all init.d script, /etc/init.d/oldy-chrooter. It’s intended to be symlinked to, so it tells which service it should run from the command used to call it, then chroots, and executes the script inside the old system. And guess what, systemd plays along with this.

The script follows. Note that it’s written in Perl, but it has the standard info header, which is required on init scripts. String manipulations are easier this way.

#!/usr/bin/perl
### BEGIN INIT INFO
# Required-Start:    $local_fs $remote_fs $syslog
# Required-Stop:     $local_fs $remote_fs $syslog
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# X-Interactive:     false
# Short-Description: Oldy root wrapper service
# Description:       Start a service within the oldy root
### END INIT INFO

use warnings;
use strict;

my $targetroot = '/oldy-root';

my ($realcmd) = ($0 =~ /\/oldy-([^\/]+)$/);

die("oldy chroot delegation script called with non-oldy command \"$0\"\n")
  unless (defined $realcmd);

chroot $targetroot or die("Failed to chroot to $targetroot\n");

exec("/etc/init.d/$realcmd", @ARGV) or
  die("Failed to execute \"/etc/init.d/$realcmd\" in oldy chroot\n");

To expose the chroot’s httpd service, make a symlink in init.d:

# cd /etc/init.d/
# ln -s oldy-chrooter oldy-httpd

And then enable with

# systemctl enable oldy-httpd
oldy-httpd.service is not a native service, redirecting to systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable oldy-httpd

which indeed runs /lib/systemd/systemd-sysv-install, a shell script, which in turn runs /usr/sbin/update-rc.d with the same arguments. The latter is a Perl script, which analyzes the init.d file, and, among others, parses the INFO header.

The result is the SysV-style generation of S01/K01 symbolic links into /etc/rcN.d. Consequently, it’s possible to start and stop the service as usual. If the service isn’t enabled (or disabled) with systemctl first, attempting to start and stop the service will result in an error message saying the service isn’t found.

It’s a good idea to install the same services on the “main” system and disable them afterwards. There’s no risk for overwriting the old root’s installation, and this allows installation and execution of programs that depend on these services (or they would complain based upon the software package database).

Running programs

Running stuff inside the chroot should be quick and easy. For this reason, I wrote a small C program, which opens a shell within the chroot when called without argument. With one argument, it executes it within the chroot. It can be called by a non-root user, and the same user is applied in the chroot.

This is compiled with

$ gcc oldy.c -o oldy -Wall -O3

and placed /usr/local/bin with setuid root:

#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main(int argc, char *argv[]) {
  const char jail[] = "/oldy-root/";
  const char newhome[] = "/oldy-root/home/eli/";
  struct passwd *pwd;

  if ((argc!=2) && (argc!=1)){
    printf("Usage: %s [ command ]\n", argv[0]);
    exit(1);
  }

  pwd = getpwuid(getuid());
  if (!pwd) {
    perror("Failed to obtain user name for current user(?!)");
    exit(1);
  }

  // It's necessary to set the ID to 0, or su asks for password despite the
  // root setuid flag of the executable

  if (setuid(0)) {
    perror("Failed to change user");
    exit(1);
  }

  if (chdir(newhome)) {
    perror("Failed to change directory");
    exit(1);
  }

  if (chroot(jail)) {
    perror("Failed to chroot");
    exit(1);
  }

  // oldycmd and oldyshell won't appear, as they're overridden by su

  if (argc == 1)
    execl("/bin/su", "oldyshell", "-", pwd->pw_name, (char *) NULL);
  else
    execl("/bin/su", "oldycmd", "-", pwd->pw_name, "-c", argv[1], (char *) NULL);
  perror("Execution failed");
  exit(1);
}

Notes:

Using setuid root is a number one for security holes. I’m not sure I would have this thing on a computer used by strangers.
getpwuid() gets the real user ID (not the effective one, as set by setuid), so the call to “su” is made with the original user (even if it’s root, of course). It will fail if that user doesn’t exist.
… but note that the user in the chroot system is then one having the same user name as in the original one, not uid. There should be no difference, but watch it if there is (security holes…?)
I used “su -” and not just executing bash for the sake of su’s “-” flag, which sets up the environment. Otherwise, it’s a mess.

It’s perfectly OK to run GUI programs with this trick. However it becomes extremely confusing with command line. Is this shell prompt on the old or new system? To fix this, edit /etc/bashrc in the chroot system only to change the prompt. I went for changing the line saying

[ "$PS1" = "\\s-\\v\\\$ " ] && PS1="[\u@\h \W]\\$ "

[ "$PS1" = "\\s-\\v\\\$ " ] && PS1="\[\e[44m\][\u@chroot \W]\[\e[m\]\\$ "

so the “\h” part, which turns into the host’s name now appears as “chroot”. But more importantly, the text background of the shell prompt is changed to blue (as opposed to nothing), so it’s easy to tell where I am.

If you’re into playing with the colors, I warmly recommend looking at this.

Lifting the user processes limit

At some point (it took a few months), I started to have failures of this sort:

$ oldy
oldyshell: /bin/bash: Resource temporarily unavailable

and even worse, some of the chroot-based utilities also failed sporadically.

Checking with ulimit -a, it turned out that the limit for the number of processes owned by my “regular” user was limited to 1024. Checking with ps, I had only about 510 processes belonging to that UID. So it’s not clear why I hit the limit. In the non-chroot environment, the limit is significantly higher.

So edit /etc/security/limits.d/90-nproc.conf (the one inside the jail), changing the line saying

-*          soft    nproc     1024

*          soft    nproc     65536

There’s no need for any reboot or anything of that sort, but the already running processes remain within the limit.

Desktop icons and wallpaper messup

This is a seemingly small, but annoying thing: When Nautilus is launched from within the old system, it restores the old wallpaper and sets all icons on the desktop. There are suggestions on how to fix it, but they rely on gsettings, which came after Fedora 12. Haven’t tested this, but is the common suggestion is:

$ gsettings set org.gnome.desktop.background show-desktop-icons false

So for old systems as mine, first, check the current value:

$ gconftool-2 --get /apps/nautilus/preferences/show_desktop

and if it’s “true”, fix it:

$ gconftool-2 --type bool --set /apps/nautilus/preferences/show_desktop false

The settings are stored in ~/.gconf/apps/nautilus/preferences/%gconf.xml.

Setting title in gnome-terminal

So someone thought that the possibility to set the title in the Terminal window, directly from the GUI, is unnecessary. That happens to be one of the most useful features, if you ask me. I’d really like to know why they dropped that. Or maybe not.

After some wandering around, and reading suggestions on how to do it in various other ways, I went for the old-new solution: Run the old executable in the new system. Namely:

# cd /usr/bin
# mv gnome-terminal new-gnome-terminal
# ln -s /oldy-root/usr/bin/gnome-terminal

It was also necessary to install some library stuff:

# apt install libvte9

But then it complained that it can’t find some terminal.xml file. So

# cd /usr/share/
# ln -s /oldy-root/usr/share/gnome-terminal

And then I needed to set up the keystroke shortcuts (Copy, Paste, New Tab etc.) but that’s really no bother.

Other things to keep in mind

Some users and groups must be migrated from the old system to the new manually. I do this always when installing a new computer to make NFS work properly etc, but in this case, some service-related users and groups need to be in sync.
Not directly related, but if the IP address of the host changes (which it usually does), set the updated IP address in /etc/sendmail.mc, and recompile. Or get an error saying “opendaemonsocket: daemon MTA: cannot bind: Cannot assign requested address”.

VMplayer: Silencing excessive hard disk activity + getting rid of freezes

eli — Tue, 30 May 2017 14:18:29 +0000

The disk is hammering

For some unknown reason, possibly after an VMplayer upgrade, running any Windows Virtual machine on my Linux machine with WMware Player caused some non-stop heavy hard disk activity, even when the guest machine was effectively idle, and made had no I/O activity of its own.

Except for being surprisingly annoying, it also made the mouse pointer non-responsive and the effect was adverse on the hosting machine as well.

So eventually I managed to get things normal by editing the virtual machine’s .vmx file as described below.

I have Vmplayer 6.0.2 on Fedora 12 (suppose both are considered quite old).

Following this post, add

isolation.tools.unity.disable = "TRUE"
unity.allowCompositingInGuest = "FALSE"
unity.enableLaunchMenu = "FALSE"
unity.showBadges = "FALSE"
unity.showBorders = "FALSE"
unity.wasCapable = "FALSE"

(unity.wasCapable was already in the file, so remove it first)

That appeared to help somewhat. But what really gave the punch was also adding

MemTrimRate = "0"
sched.mem.pshare.enable = "FALSE"
MemAllowAutoScaleDown = "FALSE"

Don’t ask me what it means. Your guess is as good as mine.

The Linux desktop freezes

Freezes = Cinnamon’s clock stops advancing for a minute or so. Apparently, it’s the graphics that doesn’t update for about 1.5 second for each time that the mouse pointer goes on or off the area belonging to the guest’s display. But it accumulates, so moving the mouse all over the place trying to figure out what’s going on easily turns this freeze out to a whole minute.

~~Just turn off the display’s hardware acceleration. That is, enter the virtual machine settings the GUI menus, pick the display, and uncheck “Accelerate 3D graphics”. Bliss.~~

Nope, it didn’t help. :(

November 2023 update: Could this be related to keyboard mapping? I had a similar issue when playing with xmodmap.

Also tried to turn off the usage of OpenGL with

mks.noGL = "FALSE"

and indeed there was nothing OpenGL related in the log file (vmware.log), but the problem remained.

This command was taken from a list of undocumented parameters (there also this one).

Upgrading to VMPlayer 15.5.6 didn’t help. Neither did adding vmmouse.present = “FALSE”.

But after the upgrade, my Windows XP got horribly slow, and it seems like it had problems accessing the disk as well (upgrading is always a good idea, as we all know). Programs didn’t seem to launch properly and such. I may have worked that around by setting the VM”s type to “Other” (i.e. not something Windows related). That turns VMTools off, and maybe that’s actually a good idea.

The solution I eventually adopted was to use VMPlayer as a VNC server. So I ignore the emulated display window that is opened directly by VMPlayer, and connect to it with a VNC viewer on the local machine instead. Rather odd, but works. The only annoying that Alt-Tab and Alt-Shift keystrokes etc. aren’t captured by the guest. To set this up, go to the virtual machine settings > Options > VNC Connections and set to enabled. If the port number is set to 5901 (i.e. 5900 with an offset of 1), the connection is done with

$ vncviewer :1 &

(or pick your other favorite viewer).

The computer is a slug

On a newer machine, with 64 GiB RAM and a more recent version of VMPlayer, it took a few seconds to go back and forth from the VMPlayer window to anything else. The fix, as root is:

# echo never > /sys/kernel/mm/transparent_hugepage/defrag
# echo never > /sys/kernel/mm/transparent_hugepage/enabled

taken from here. There’s still some slight freezes when working on a window that overlaps the VMPlayer window (and other kinds of backs and forths with VMPlayer), but it’s significantly better this way.