Tutorial For Lpi Exam 202: Part 4

Topic 208: Web Services


David Mertz, Ph.D.
Professional Neophyte
November, 2005

Welcome to "Web Services", the fourth of seven tutorials covering intermediate network administration on Linux. In this tutorial, we discuss how to configure and run the Apache HTTPd server, and some ancillary servers like the Squid Web Proxy Cache.

Before You Start

About this series

The Linux Professional Institute (LPI) certifies Linux system administrators at junior and intermediate levels. There are two exams at each certification level. This series of seven tutorials helps you prepare for the second of the two LPI intermediate level system administrator exams--LPI exam 202. A companion series of tutorials is available for the other intermediate level exam--LPI exam 201. Both exam 201 and exam 202 are required for intermediate level certification. Intermediate level certification is also known as certification level 2.

Each exam covers several or topics and each topic has a weight. The weight indicate the relative importance of each topic. Very roughly, expect more questions on the exam for topics with higher weight. The topics and their weights for LPI exam 202 are:

Topic 205: Network Configuration (8) Topic 206: Mail and News (9) Topic 207: Domain Name System (DNS) (8) Topic 208: Web Services (6) Topic 210: Network Client Management (6) Topic 212: System Security (10) * Topic 214: Network Troubleshooting (1)

About this tutorial

Welcome to "Web Services", the fourth of seven tutorials covering intermediate network administration on Linux. In this tutorial, we discuss how to configure and run the Apache HTTPd server, and some ancillary servers like the Squid Web Proxy Cache. It is worth noting some of what is not covered in this tutorial: designing and modifying HTML pages; writing CGI scripts; analyzing security issues (beyond some very basics); accessing backend databases; and generally everything about "web programming". For this tutorial, we just want you to learn how to get a web server running, not how to provide useful content on that web server.

Prerequisites

To get the most from this tutorial, you should already have a basic knowledge of Linux and a working Linux system on which you can practice the commands covered in this tutorial.

About Apache

Apache is the predominant web server on the internet as a whole, and is even more predominant when only Linux servers are under discussion. A few more special purpose--and occassionally higher performance for specific tasks--web servers are available, but Apache is nonetheless always your default choice. Apache comes pre-installed by most Linux distributions, and in fact is often already running after being launched during initialization, even if you have not specifically configured a web server. If Apache is not installed, use the normal installation system of your distribution to install it; or download the latest source from <http://httpd.apache.org/download.cgi>. Many extra capabilities are provided by modules, many distributed with Apache itself, others available from third parties.

The latest Apache has been at the 2.x level since 2001, however, Apache 1.3.x is still in widespread usage, and the 1.3.x series continues to be maintained for bugfixes and security updates. Some minor configuration difference exist between 1.3 and 2.x version, and a few modules are available for 1.3 that are not available for 2.x. The latest releases as of this writing are 1.3.34 (stable), 2.0.55 (stable) and 2.1.9 (beta).

As a rule, a new server should use the latest stable version in the 2.x series. Unless you have a specific need for an unusual older module, 2.x provides good stability, more capabilities, and overall mostly better peformance (in some tasks, such as in PHP support, 1.3 still peforms better). Moving forward, new features will certainly be better supported in 2.x than in 1.3.x.

About Sqid

Squid is a proxy caching server for web clients, supporting the protocols HTTP, FTP, TLS, SSL, and HTTPS. By running a cache on a local network, or at least closer to your network than the resources queries, speed can be improved and network bandwidth reduced. When the same resource is requested multiple times by machines served by the same Squid server, the resources is delivered from a server-local copy rather than requiring the request go out over multiple network routers, and to potentially slow or overloaded destination servers.

It is possible to configure Squid either as an explicit proxy that must be configured in each web client (browser), or to intercept all web requests out of a LAN and cache all such traffic. Squid may be configured with various options about how long, and under what conditions, to keep pages cached.

Other resources

As with most Linux tools, it is always useful to examine the manpages for any utilities discussed. Versions and switches might change between utility or kernel version, or with different Linux distributions. For more in depth information, the Linux Documentation Project has a variety of useful documents, especially its HOWTOs. See http://www.tldp.org/. A variety of books on Linux networking have been published; I have found O'Reilly's TCP/IP Network Administration, by Craig Hunt to be quite helpful (find whatever edition is most current when you read this).

A large number of good books have been written on working with Apache. Some are concerned with general administration, while others cover particular modules or special configurations of Apache. Check your local bookseller for a range of available titles. Out of the dozens of titles available, none stand out to this writer for special mention, though many of them are quite excellent.

Running Apache

A swarm of daemons

Launching Apache is similar to launching any other daemon. Usually you wish to put its launch in your system initialization scripts, but in principle you may launch Apache at any time. On most systems, the Apache server is called httpd, though it may be called apache2 instead. The server is probably installed in /usr/sbin/, but other locations are possible, depending on distribution and how you installed the server.

Most of the time you will launch Apache with no options, but the -d serverroot and -f config options are worth keeping in mind. The first lets you specify a location on the local disks where content will be served from; the second lets you specify a non-default configuration file. A configuration file may override the -f option using the ServerRoot directive (most do). By default, configuration files are either apache2.conf or httpd.conf, depending on compilation options; these files might live at /etc/apache2/, /etc/apache/, /etc/httpd/conf/, /etc/httpd/apache/conf or a few other locations, depending on version, Linux distribution, and how you installed of compiled Apache. Checking man apache2 or man httpd should give you system-specific details.

The Apache daemon is unusual when compared with other servers in that it usually creates several running copies of itself. The primary copy simply spawns the others, while these secondary copies service the actual incoming requests. The goal in having multiple running copies is to act as a "pool" for requests that may arrive in bundles; additional copies of the daemon are launched as needed, according to several configuration parameters. The primary copy usually runs as root, but the other copies as a more restricted user for security reasons. E.g.:

# ps axu | grep apache2 root 6620 Ss Nov12 0:00 /usr/sbin/apache2 -k start -DSSL www-data 6621 S Nov12 0:00 /usr/sbin/apache2 -k start -DSSL www-data 6622 Sl Nov12 0:00 /usr/sbin/apache2 -k start -DSSL www-data 6624 Sl Nov12 0:00 /usr/sbin/apache2 -k start -DSSL dqm 313 S+ 03:44 0:00 man apache2 root 637 S+ 03:59 0:00 grep apache2

On many systems, the restricted user will be nobody. In the example above it is www-data.

Including configuration files

As mentioned in the last panel, the behavior of Apache is affected by directives in its configuration file. For Apache2 systems, the main configuration file is likely to reside at /etc/apache2/apache2.conf; but often this file will contain multiple "Include" statements to add configuration information from other files, possibly by wildcard pattern. Overall, an Apache configuration is likely to contain hundreds of directives and options (most not specifically documented in this tutorial).

I few files are particularly like to be included. You might see httpd.conf for "user" settings, and to utilize prior Apache 1.3 configuration files that use that name. Virtual hosts are typically specified in separate configuration files, matched on a wildcard, e.g.:

# Include the virtual host configurations:
Include /etc/apache2/sites-enabled/[^.#]*

With Apache 2.x, modules are typically specified in separate configuration files as well (more often in the same file in 1.3.x). For example, a system of mine includes:

From '/etc/apache2/apache2.conf'

# Include module configuration:
Include /etc/apache2/mods-enabled/*.load
Include /etc/apache2/mods-enabled/*.conf

Actually using a module in a running Apache requires two steps in the configuration file, both loading it and enabling it, e.g.:

Loading an option Apache module

# cat /etc/apache2/mods-enabled/userdir.load
LoadModule userdir_module /usr/lib/apache2/modules/mod_userdir.so
# cat /etc/apache2/mods-enabled/userdir.conf
<IfModule mod_userdir.c>
    UserDir public_html
    UserDir disabled root

    <Directory /home/*/public_html>
        AllowOverride FileInfo AuthConfig Limit
        Options MultiViews Indexes SymLinksIfOwnerMatch IncludesNoExec
    </Directory>
</IfModule>

The wildcards in the Include lines will insert all the .load and .conf files in the /etc/apache2/mods-enabled/ directory.

One thing to notice here is a general pattern: basic directives are one line commands with some options; more complex directives nest commands using an XML-like open/close tag. You just have to know for each directive whether it is one-line or open/close style, you cannot choose among styles at will.

Log files

An important class of configuration directives concern logging of Apache operations. A lot of different types of information, and degree of detail, can be maintained for Apache operations. An error log is always a good thing to keep, and you can specify it with the single directive:

# Global error log.
ErrorLog /var/log/apache2/error.log

Other logs of server access, referrers, of other information, can be customized to fit your individual setup. A logging operation is configured with two directives. First a LogFormat directive uses a set of special variables to specify what goes in the log file; second a CustomLog directive tells Apache to actually record events in the specified format. An unlimited number of formats may be specified, whether or not each one is actually used. This allows you to switch on and off logging details based on evolving needs.

Variables in a LogFormat are similar to shell variables, but with a leading %. Some variables have single-letter while others have long names surrounded by brackets. For example:

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined
CustomLog /var/log/apache2/referer_log combined

Consult a book or full Apache documentation for the list of variables. Commonly used ones include %h for IP address of requesting client, %t for datetime of the request, %>s for HTTP status code, and the mispelled %{Referer} for the referring site that led to the served page.

The name used in the LogFormat and CustomLog directives is arbitrary. In the example, the name combined was used, but it could just as well be myfoobarlog. However, a few names are commonly used and come with sample configuration files, such as combined, common, referer, agent. These specific formats are typically supported directly by log-analyzer tools.

Virtual hosts, multi-homing, and per-directory options

Individual directories served by an Apache server may have their own configuration options. However, the main configuration may limit which options can be configured locally. If per-directory configuration is desired, use the AccessFileName directive, and typically specify the local configuration filename of .htaccess. The limitations of local configuration are specified within a <Directory> directive. For example:

Example of Directory directive

#Let's have some Icons, shall we?
Alias /icons/ "/usr/share/apache2/icons/"
<Directory "/usr/share/apache2/icons">
      Options Indexes MultiViews
      AllowOverride None
      Order allow,deny
      Allow from all
</Directory>

Often working in conjunction with per-directory options, Apache can service "virtual hosts". What this means is that multiple domain names may be served from the same Apache process, each accessing an appropriate directory. Defining virtual hosts is done with the <VirtualHost> directive. This may be done by placing configuration files in an included directory, such as /etc/apache2/sites-enabled/, or it may be contained directly in a main configuration file. In use, you might specify, e.g.:

Configuring virtual hosts

<VirtualHost "foo.example.com">
    ServerAdmin [email protected]
    DocumentRoot /var/www/foo
    ServerName foo.example.com
    <Directory /var/www/foo>
          Options Indexes FollowSymLinks MultiViews
          AllowOverride None
          Order allow,deny
          allow from all
    </Directory>
    ScriptAlias /cgi-bin/ /usr/lib/cgi-bin/
    <Directory "/usr/lib/cgi-bin">
          AllowOverride None
          Options ExecCGI -MultiViews +SymLinksIfOwnerMatch
          Order allow,deny
          Allow from all
    </Directory>
    CustomLog /var/log/apache2/foo_access.log combined
</VirtualHost>
<VirtualHost "bar.example.org">
    DocumentRoot /var/www/bar
    ServerName bar.example.org
</VirtualHost>
<VirtualHost *>
    DocumentRoot /var/www
</VirtualHost>

The final * option picks up any HTTP requests that are not directed to one of the explicitly specified names (e.g. addressed by IP address, or as an unspecified symbolic domain which also resolves to the server machine). For virtual domains to work, DNS must define each alias with a CNAME record.

Multi-homed servers sound similar to virtual hosting, but the concept is different. Using multi-homing, you may configure which IP addresses that a machine is connected to allow web requests. For example, you might provide HTTP access only to the local LAN, but not to the outside world. If you specify an address to listen on, you may also indicate a non-default port. The default value for BindAddress is *, which means to accept requests on every IP address under which the server may be reched. A mixed example might look like:

Configuring multi-homing

BindAddress 192.168.2.2
Listen 192.168.2.2:8000
Listen 64.41.64.172:8080

In this case, we will accept all client requests from the local LAN (i.e. that use the 192.168.2.2 address) on the default port 80, and on the special port 8000. This Apache installation will also monitor client HTTP requests from the WAN address, but only on port 8080.

Limiting web access

You may enable per-directory server access with the Order, Allow from and Deny from commands, within a <Directory> directive. Denied or allowed addresses may be specified by full or partial hostnames or IP addresses. Order lets you give precedence between the accept list and the deny list.

In many cases, you would like more fine-tuned control than simply allowing particular hosts to access your web server. To enable user login requirements, you may use the Auth* family of commands, again within the <Directory> directive. For example, to setup Basic Authentication, you might use a directive like:

Configuring Basic Authentication

<Directory "/var/www/baz">
    AuthName "Baz"
    AuthType Basic
    AuthUserFile /etc/apache2/http.passwords
    AuthGroupFile /etc/apache2/http.groups
    Require john jill sally bob
</Directory>

You may also specify Basic Authentication within a per-directory .htaccess file. Digest Authentication is more secure than Basic, but is less widely implemented in browsers. However, the weakness of Basic (that it transmits passwords in clear-text) is better solved with an SSL layer anyway.

Support for SSL encryption of web traffic is provided by the module mod_ssl. When SSL is used, data transmitted between server and client is encrypted with a dynamically negotiated password that is resistent to interception. All major browsers support SSL. For more information on configuring Apache 2.x with mod_ssl see <http://httpd.apache.org/docs/2.0/mod/mod_ssl.html>.

Running Squid

Installing and running Squid

In most distributions, you should be able to install Squid using the normal installation procedures. You may obtain the source verion of Squid from <http://www.squid-cache.org/>. Building from source uses the basic ./configure; make; make install sequence.

Once installed, you may simply run, as root, /usr/sbin/squid (or whatever location your distribution uses, perhaps /usr/local/sbin/). Of course, to do much useful, you will want to configure the Squid configuration file at /etc/squid/squid.conf, /usr/local/squid/etc/squid.conf, or wherever precisely your system locates squid.conf. As with almost all daemons, you may use a different configuration file, in this case with the -f option.

Ports, IP addresses, http_access and ACLs

The most important configuration option for Squid is the http_port options you select. You may monitor whichever ports you wish, optionally attaching each one to a particular IP address or hostname. The default port for Squid is 3128, allowing any IP address that connects to the Squid server. To cache only for a LAN, you want to specify the local IP address instead, e.g.:

# default (disabled)
# http_port 3128
# LAN only
http_port 192.168.2.2:3128

You may also enable caching via other Squid servers using the icp_port and htcp_port. The IPC and HTCP protocols are used for caches to communicate between themselves, rather than by web servers and clients themselves. To cache multicasts, use mcast_groups.

To let clients connect to your Squid server, you need to give them permission to do so. Unlike a Web server, Squid is typically not entirely generous with its resources. In the simple case, we can just use a couple subnet/netmask or CIDR (Classless Internet Domain Routing) patterns to control permissions, e.g.:

Simple Squid access permissions

http_access deny 10.0.1.0/255.255.255.0
http_access allow 10.0.0.0/8
icp_access allow 10.0.0.0/8

The acl directive can be used to name access control lists (ACLs). You can name src ACLs that simply specify address ranges as in the above example; but you can also create other types of ACLs. For example:

Fine tuned access permissions

acl mynetwork     src             192.168/16
acl asp           urlpath_regex   \.asp$
acl bad_ports     port            70 873
acl javascript    rep_mime_type -i ^application/x-javascript$
# what HTTP access to allow classes
http_access deny asp          # don't cache active server pages
http_access deny bad_ports    # don't cache gopher or rsync
http_access deny javascript   # don't cache javascript content
http_access allow mynetwork   # allow LAN everything not denied

This example gives only a small subset of the available ACL types. See a sample squid.conf for example of many others. Or take a look at the documentation at <http://squid-docs.sourceforge.net/latest/html/c1389.html>. In this case, we decide not to cache URLs that end with .asp (probably dynamic content), not to cache ports 70 and 873, and not to cache returned javascript objects. Other than what is denied, machines on the LAN (the /16 range given) will have all their requests cached. Notice that each ACL defined has a unique, but arbitrary, name (use names that make sense, but the names are not reserved).

Caching modes

The simplest way to run Sqid is in proxy mode. If you do this, clients will need to be explicitly configured to use the cache. Web browser clients have configuration screens that allow them to specify a proxy address and port rather than a direct HTTP connection. This setup makes configuring Squid very simple, but makes clients do some setup work if they want to benefit from the Squid cache.

You may also configure Squid to run as a transparent cache. To do this you need to either configure policy based routing (outside of Squid itself, using ipchains or ipfilter) or use your Squid server as a gateway. Assuming you can direct external requests via the Squid server, Squid needs to be configured as follows. You may need to recompile Squid with the --enable-ipf-transparent option; however, in most Linux installations this should already be fine. To configure the server for transparent caching (once it gets the redirected packets), add something like the below to your squid.conf:

httpd_accel_host virtual
httpd_accel_port 80
httpd_accel_with_proxy on
httpd_accel_uses_host_header on