Brad Huntting and David Mertz
Professional Neophyte
August, 2005
Welcome to "Troubleshooting", the eighth of eight tutorials designed to prepare you for LPI exam 201. In large part this tutorial reiterates material already covered in more detail in earlier tutorials, but with an eye to "what to do when things go wrong?"
The Linux Professional Institute (LPI) certifies Linux system administrators at junior and intermediate levels. There are two exams at each certification level. This series of eight tutorials helps you prepare for the first of the two LPI intermediate level system administrator exams--LPI exam 201. A companion series of tutorials is available for the other intermediate level exam--LPI exam 202. Both exam 201 and exam 202 are required for intermediate level certification. Intermediate level certification is also known as certification level 2.
Each exam covers several or topics and each topic has a weight. The weight indicate the relative importance of each topic. Very roughly, expect more questions on the exam for topics with higher weight. The topics and their weights for LPI exam 201 are:
Topic 201: Linux Kernel (5) Topic 202: System Startup (5) Topic 203: Filesystems (10) Topic 204: Hardware (8) Topic 209: File Sharing Servers (8) Topic 211: System Maintenance (4) Topic 213: System Customization and Automation (3) Topic 214: Troubleshooting (6)
Welcome to "Troubleshooting", the eighth of eight tutorials designed to prepare you for LPI exam 201. In large part this tutorial reiterates material already covered in more detail in earlier tutorials, but with an eye to "what to do when things go wrong?"
To get the most from this tutorial, you should already have a basic knowledge of Linux and a working Linux system on which you can practice the commands covered in this tutorial.
Unfortunately, troubleshooting a Linux system that is not working--or more especially, is not entirely working--is probably the most frustrating parts of system administration. Linux is not special in this regard, the same can be said of any operating system or other complex software system. It is relatively straightforward, most of the time, to add new hardware or software, tweak the behavior of installed software, or script and schedule routine operations. But when a running system suddenly doesn't, a system administrator is presented with a dizzying collection of things that might be wrong.
This tutorial cannot solve many real world problems. That takes a good understanding of Linux overall, combined with even more patience and elbow grease. But we will point to tools that are helpful in the
troubleshooting process (most touched on in other tutorials), and
remind readers of some of the best places to look for sources of trouble.
When, in course of system events, a Linux installation gets so messed up that it just cannot boot normally, a good first course of action is to attempt to "boot single-user" to fix the problem. Single user mode (runlevel 1) allows you to access and modify many problems without worrying about permission issues or files being locked by other users and processes. Moreover, single user mode runs with a minimum of services, daemons and tasks that might confound--or even be the cause of--your underlying problem.
In some cases, however, even files needed for single user mode such as
/etc/passwd
, /etc/fstab
, /etc/inittab
, /sbin/init
,
/dev/console
, etc. can become corrupted, and the system will not
boot, even into single-user mode. In times like this, a recovery disk
(a bootable disk which contains the essentials needed to repair a
broken partition, sometimes called a repair disk
) can be used to get
the system back on its feet.
Nowadays, standalone CD (or DVD) distributions such as Knoppix can be used as "Cadillac" recovery disks, sporting a vast array of drivers, debugging tools, and even a web browser. Even if you have special needs such as unusual drivers (e.g. kernel modules) or recovery software, you are usually best off downloading a Linux Live-CD like Knoppix, and customizing the contents of the ISO image before burning a recovery disk. See the tutorial for Topic 203 for information on loop mounting an ISO image. You may obtain and read about Knoppix at http://www.knoppix.org/.
For very old systems without bootable CD-ROMs, floppies containing
only the bare essentials have to be used for recovery. Niceties like
emacs
and vim
are typically too bulky to fit on a single recovery
floppy, so an experienced system administrator should be prepaired to
use a striped down editor like ed
to fix corrupted configuration
files.
To create custom boot floppies, reader should read the Linux Documentation Project's "Linux-Bootdisk HOWTO." In many cases, however, Linux distributions will offer to create a recovery diskettes at installation time, or later using included tools. Be sure to create and store recover disks before you really need them!
The tutorial on Topic 202 contains much more extensive information about Linux boot sequences. We will review the stages in quick summary for this tutorial
The first stage of bootup has changed little since IBM compatible PC's with harddisks where first introduced. The BIOS reads the first sector of the boot disk into memory and runs it. This 512 byte Master Boot Record (MBR), which also stores the fdisk label, turns around and loads the "OS loader" (either GRUB or LILO) from the "active" partition.
Once the system BIOS (on an x86 system, other architectures vary slightly) runs the MBR, several steps happen for the Linux kernel to load. If boot does not succeed, determining where it fails is the first step in figuring out what needs to be fixed.
* Load Boot Loader (LILO/Grub)
* Boot loader starts and hands off control to kernel
* Kernel: Load base kernel and other essential kernel modules.
* Hardware initializiation and setup
Kernel starts Initialize kernel infrastructure (VM, scheduler, etc) Device Drivers Probe and Attach Root filesystem mounted
Assuming a base kernel, kernel modules, and root filesystem start successfully (or at least successfully enough not to freeze completely), the system initialization process starts:
* daemon initialization and setup
/sbin/init started as process 1. /sbin/init reads /etc/inittab /etc/rc<n>.d/S* scripts are run by /etc/init.d/rc disks fsck'ed network interfaces configured daemons started - getty(s) started on console and serial ports
When Grub starts up it flashes "GRUB loading, please wait..." on the
screen and then jumps into its boot menu. LILO usually displays a
LILO:
prompt (but some versions jump directly to a menu). If you do
not reach a LILO or GRUB screen, it is probably time for a separate
recovery disk. Generally, you will know you have this problem with
some kind of BIOS message about not finding a boot sector, or maybe
even not finding a harddisk at all.
If your system cannot locate a usable boot sector, it is possible that
your LILO or GRUB installation became corrupted. Sometimes other
operating systems step on boot sectors (or overzealous "virus
protection programs" on Windows), and occasionally other problems
arise. A recovery disk will let you reinstall LILO or GRUB. Typically
you will want to chroot
to your old root filesystem to utilize your
prior LILO or GRUB configuration files.
In the hardisk-not-found case, hardware failure is likely (and you will hope you made a backup); often in the no-boot-sector case too. See the tutorial on Topic 202 for more information on LILO and GRUB.
LILO is configured with the file /etc/lilo.conf
, or by another
configuration file specified at the command line. To reinstall LILO,
first make sure your lilo.conf
file really matches your current
system configuration (the right partition and disk number information
especially; this can become mixed up if you swap harddisk and/or
change partitions). Once you have verified your configuration, simply
run /sbin/lilo
as root (or in single user mode).
Configuring and reinstalling GRUB is essentially the same process as
for LILO. GRUB's configuration file is /boot/grub/menu.lst
, but it
is read each time the system boots. Likewise, the GRUB shell live at
/boot/grub/
, but you can choose which harddisk the basic GRUB MBR
lives at with the groot=
option in /boot/grub/menu.lst
. To
install GRUB to an MBR, run a command like grub-install /dev/hda
,
which will check the currently mounted /boot/
for its configuration.
Linux distributions vary a bit in where they put files. The Linux Filesystem Hierarchy Standard is making progress in furthering standardization of this. A few standard directories are already well standardized, and are particularly important to look at for startup and runtime problems:
* /proc/ - This is the virtual filesystem where information on
processes and system status. All the guts of a running system live here, in a sense. See Topic 201 for more information.
* /var/log/ - Log files live here. If something goes wrong, changes
are good that helpful information is in some log file here.
* / - Generally the root of the filesystem under Linux simply contains
other nested directories. On some systems bootup files likevmlinuz
andinitrd.img
might be here rather than in/boot/
.
* /boot/ - Files directly used in the kernel boot process.
* /lib/modules/ - Kernel modules live nested under this directory, in
subdirectories named for the current kernel version (if you boot multiple kernel versions, multiple directories should exist). For example:
% ls /lib/modules/2.6.10-5-386/kernel/ arch crypto drivers fs lib net security sound
During actual system boot, messages can scroll by quickly, and you may
not have time to identify problems or unexpected initialization
activities. Some information of interest is likely to be logged with
syslog
, but the basic kernel and kernel module messages can be
examined with the tool dmesg
.
The tutorial for Topic 203 (Hardware) contains more information on hardware identification. Generally, you should keep in mind these system tools:
lspci: List all PCI devices lsmod: List loaded kernel modules lsusb: List USB devices lspnp: List Plug-and-Play devices * lshw: List hardware
Slightly farther from hardware, but still useful:
lsof: List open files insmod: Load kernel module rmmod: Remove kernel module modprobe: Higher level insmod/rmmod/lsmod wrapper uname: Print system information (kernel version, etc) strace: Trace system calls
When you get desperate in trying to use binaries, whether libraries or
applications, the tool strings
can really rescue you (but expect to
work at it). Crucial information like hardcoded paths are sometimes
buried in binaries, and with a good measure of trial-and-error, you
can find them by looking for the strings
within binaries.
As Topic 202 addresses in more detail, the startup of Linux, after the
kernel loads, is controlled by the init
process. The main
controlling configuration for init
is /etc/inittab
. The
/etc/inittab
contains details on what steps to peform at each
runlevel. But perhaps the most crucial thing it does is actually set
the runlevel for later actions. If your system is having troubles
booting, a setting a different runlevel may help the process. The key
line is something like:
# The default runlevel. (in /etc/inittab) id:2:initdefault:
The rc scripts run at boot time, shutdown time, and whenever the
system changes runlevels and are responsible for starting and stoping
the most system daemons. In most (i.e. modern) Linux distributions,
the rc scripts live in the directory /etc/init.d/
and are linked to
the directories /etc/rc<N>.d/
(for N=runlevel) where they are named
S*
for startup scripts and K*
for shutdown scripts. The system
never runs the scripts from /etc/init.d
directly, but looks for them
in /etc/rc<N>.d/[SK]*
.
Rarely, but conceivably, a system administrator may wish to change the
system wide shell startup script /etc/profile
. Such a change affects
every interactive shell (except for /bin/tcsh
users or other
non-'/bin/sh' compatible shells). Corrupting this file can easily lead
to a situation where no one can login, and a recovery disk may be
needed to rectify the situation. The normal method of changing shell
behavior on an individual basis is by modifying
/home/$USER/.bash_profile
and /home/$USER/.bashrc
.
The sysctl
system (see the manpage for sysctl
) was taken from BSD
Unix and is be used to tune some system resources. Run sysctl -a
to
see what variables can be controlled by sysctl
and what they are set
to. The sysctl
utility is most useful to tuning networking
parameters as well as some kernel parameters. Use the file
/etc/sysctl.conf
set sysctl
parameters at boot time.
On most systems, dynamic libraries are constantly being added,
updated, replaced and removed. Since almost every program on the
system needs to find and load dynamic libraries, the names version
numbers and locations of most dynamic libraries are cached by the
ldconfig
program. Normally only the dynamic libraries in the system
directories /lib/
and /usr/lib/
are cached. To add more
directories to the system wide default search path, add the directory
names (such as /usr/X11R6/lib
) to the /etc/ld.so.conf
and run
ldconfig
as root.
Topic 211 discusses syslog
in detail. The key file to keep in mind
if you have problems (and especially if you want to make sure you can
analyze them later), is /etc/syslog.conf
. By changing the contents
of this configuration, you have fine-tuned control over what events
are logged, and where they are logged--including potentially via mail
or remote machines. If problems occur, make sure the subsystems where
you think they arise log information in a manner you can readily
examine forensically.
On almost every Linux system, the cron
daemon is running. See Topic
213 for more details on cron
and crontab
. In general a source of
potential problems to keep in mind is scripts that are run
intermittently. It is possible that what you believe to be a kernel
or application issue results from unrelated scripts that are run
behind the scenes by cron
.
An extreme measure is to kill cron
altogether. You can find its
process nubmer with ps ax | grep cron
, and kill
can terminate it.
A less drastic measure is to edit /etc/crontab
to run a more
conservative collection of tasks; also be sure to contrain the user
tasks that are scheduled by editing /etc/cron.allow
and
/etc/cron.deny
. Even though users usually should not have enough
permissions to cause system-wide problems, a good first step is to
temporarily disable user crontab
configurations, and see if that
resolves problems.