OpenRC and fsck
This is a wiki page. Be bold and improve it!
If you have any questions about the content on this page, don't hesitate to open a new ticket and we'll do our best to assist you.
Introduction
OpenRC has a script in /etc/init.d/fsck
to check the drives and partitions to mount.
There is currently no official documentation on the fsck
script, on its behaviour nor on how to configure and control its behaviour. This page is a placeholder for information that could contribute to creating a comprehensive documentation on OpenRC and fsck.
The script /etc/init.d/fsck is run in all situations.
/etc/init.d:
root: need fsck
localmount: need fsck
Troubleshooting
fsck.ext4: Unable to resolve UUID...
Depending on your /etc/fstab, you may find a long list of output with "Unable to resolve UUID":
* Executing: /lib/rc/sh/openrc-run.sh /lib/rc/sh/openrc-run.sh /etc/init.d/fsck start
* Checking local filesystems ...
fsck.ext4: Unable to resolve 'UUID=...'
fsck.ext4: Unable to resolve 'UUID=...'
fsck.ext4: Unable to resolve 'UUID=...'
...
The cause is obvious: you have in your fstab some entries about removable medias that are not plugged at boot time. It is not a problem in itself.
The questions are:
- Should OpenRC attempt to check disks that are not plugged in?
- Should it report failure to resolve such disks, especially when the fstab entry indicates: user,noauto
?
- Does this slow down the boot procedure?
- How to prevent the automatic check of unplugged removable devices?
fsck fails with a non-system critical partition
Bug report: https://bugs.gentoo.org/698072
The main problem we are dealing with here is that OpenRC can fail critically when failing to check some drives or partitions that are not critical for the system to run. The root partition is fine and can be mounted normally, and possibly even the /home/ partition, but a removable media, a file storage media or another non-critical media (i.e. a partition without which the system can run fine) which causes fsck to fail may cause the boot process to fail altogether, with OpenRC refusing to properly mount the root system.
In such a situation, the only solution is to boot the machine with a live CD or a rescue bootable removable device, manually mount the system's root device, and manually edit /etc/fstab to remove the offending entry, before rebooting the system normally, and only then deal with the failing media.
The above scenario may happen when fsck fails to repair a given partition.
The way OpenRC handles such non-system-critical problems is not ideal. It makes diagnostic and recovery more difficult than it should be.
The most critical problems currently are:
- when fsck fails, the user is dropped to a login shell. The user may have turned the computer on, gone away for a while, and come back to a login shell where she expected to find the usual Desktop Environemnt login screen. There is no indication on the screen on why the regular boot process failed.
- The root partition was never mounted, so nothing was even logged so the user cannot investigate anything. He has to reboot and try as best as he can to follow the boot output and figure out where it fails.
- The root partition is mounted read only. It is expected that the user will now how to remount the root partition in write mode, so that he can modify /etc/fstab to (at least temporarily) fix the problem, and be able to boot the system. That's of course, assuming that the user has already identified what the problem is.
There should be two important goals:
- Make sure that the user/administrator is probably aware that a partition failed fsck.
- Allow diagnostic and recovery to be as painless as possible.
The partition may not be system critical, but the data within is probably important to the user. Thus, it is not enough to rely on logging only. At the same time, the system should be able to continue with the boot process, so that the user can straight away access the critical system files like /etc/fstab and deal with the problem.
The best solution would thus probably to directly drop into interactive mode whenever fsck encounters a critical problem with any drive.
OpenRC should indicate what drive failed fsck, what the error is, and then then wait for user input to continue the boot process, leaving the failing drive unmounted, but mounting and starting the system itself.
Thus the user is properly informed, but he can also easily complete the boot sequence and he can use a fully operational system to deal with the actual problem.
fsck fails with unplugged devices
Bug report: https://bugs.gentoo.org/698072
The problem described is this section is the same as the one described in the previous section, but the conditions that trigger it make it all the more unacceptable that the boot process was not allowed to complete.
In this situation, a removable device that was not plugged in, and that the user did not intend to have plugged in, and that is absolutely not necessary for the system to properly complete the boot process and to run normally, caused the system to be unusable, with the boot process interrupted, and the root partition mounted read-only, and the user dropped to a root login shell, without any indication of what went wrong. Again, the solution was to reboot the system with a live CD or a rescue botable removable device, and manually edit /etc/fstab, which requires again, of course, that the user has properly guessed the root cause of the problem.
The situation is thus: the user has an entry in /etc/fstab for a vfat removable usb stick:
UUID=AAAA-BBBB /media/my_usb_drive vfat user,noauto 0
At the time of booting, the given drive was not plugged in, was not intended to be plugged in, and is not necessary at all for the system to run.
Arriving to a proper diagnostic was difficult because, as in the previous section, nothing is logged because the root filesystem never gets mounted for logs to be written. The error message only flashes very quickly on the screen before the user is dropped to a login shell. Only after several attempts at booting, making a video recording of the screen output, and analysing the video, allowed the user to figure out what the problem was. (Thankfully, nowadays, smartphones with video recording capabilities are commonly available!)
fsck failed with the following error message (manually copied from the video recording, since no logs exist):
fsck.fat 4.1 (2017-01-24)
open: No such file or directory.
* Filesystems couldn't be fixed.
* rc: Aborting!
* fsck:caught SIGTERM, aborting.
INIT: Entering runlevel 3