TUTORIAL FOR LPI EXAM 202: part 4 Topic 208: Web Services David Mertz, Ph.D. Professional Neophyte November, 2005 Welcome to "Web Services", the fourth of seven tutorials covering intermediate network administration on Linux. In this tutorial, we discuss how to configure and run the Apache HTTPd server, and some ancillary servers like the Squid Web Proxy Cache. BEFORE YOU START ------------------------------------------------------------------------ About this series The Linux Professional Institute (LPI) certifies Linux system administrators at junior and intermediate levels. There are two exams at each certification level. This series of seven tutorials helps you prepare for the second of the two LPI intermediate level system administrator exams--LPI exam 202. A companion series of tutorials is available for the other intermediate level exam--LPI exam 201. Both exam 201 and exam 202 are required for intermediate level certification. Intermediate level certification is also known as certification level 2. Each exam covers several or topics and each topic has a weight. The weight indicate the relative importance of each topic. Very roughly, expect more questions on the exam for topics with higher weight. The topics and their weights for LPI exam 202 are: * Topic 205: Network Configuration (8) * Topic 206: Mail and News (9) * Topic 207: Domain Name System (DNS) (8) * Topic 208: Web Services (6) * Topic 210: Network Client Management (6) * Topic 212: System Security (10) * Topic 214: Network Troubleshooting (1) About this tutorial Welcome to "Web Services", the fourth of seven tutorials covering intermediate network administration on Linux. In this tutorial, we discuss how to configure and run the Apache HTTPd server, and some ancillary servers like the Squid Web Proxy Cache. It is worth noting some of what is -not- covered in this tutorial: designing and modifying HTML pages; writing CGI scripts; analyzing security issues (beyond some very basics); accessing backend databases; and generally everything about "web programming". For this tutorial, we just want you to learn how to get a web server running, not how to provide useful content on that web server. Prerequisites To get the most from this tutorial, you should already have a basic knowledge of Linux and a working Linux system on which you can practice the commands covered in this tutorial. About Apache Apache is the predominant web server on the internet as a whole, and is even more predominant when only Linux servers are under discussion. A few more special purpose--and occassionally higher performance for specific tasks--web servers are available, but Apache is nonetheless always your default choice. Apache comes pre-installed by most Linux distributions, and in fact is often already running after being launched during initialization, even if you have not specifically configured a web server. If Apache is not installed, use the normal installation system of your distribution to install it; or download the latest source from . Many extra capabilities are provided by modules, many distributed with Apache itself, others available from third parties. The latest Apache has been at the 2.x level since 2001, however, Apache 1.3.x is still in widespread usage, and the 1.3.x series continues to be maintained for bugfixes and security updates. Some minor configuration difference exist between 1.3 and 2.x version, and a few modules are available for 1.3 that are not available for 2.x. The latest releases as of this writing are 1.3.34 (stable), 2.0.55 (stable) and 2.1.9 (beta). As a rule, a new server should use the latest stable version in the 2.x series. Unless you have a specific need for an unusual older module, 2.x provides good stability, more capabilities, and overall mostly better peformance (in some tasks, such as in PHP support, 1.3 still peforms better). Moving forward, new features will certainly be better supported in 2.x than in 1.3.x. About Sqid Squid is a proxy caching server for web clients, supporting the protocols HTTP, FTP, TLS, SSL, and HTTPS. By running a cache on a local network, or at least closer to your network than the resources queries, speed can be improved and network bandwidth reduced. When the same resource is requested multiple times by machines served by the same Squid server, the resources is delivered from a server-local copy rather than requiring the request go out over multiple network routers, and to potentially slow or overloaded destination servers. It is possible to configure Squid either as an explicit proxy that must be configured in each web client (browser), or to intercept all web requests out of a LAN and cache all such traffic. Squid may be configured with various options about how long, and under what conditions, to keep pages cached. Other resources As with most Linux tools, it is always useful to examine the manpages for any utilities discussed. Versions and switches might change between utility or kernel version, or with different Linux distributions. For more in depth information, the Linux Documentation Project has a variety of useful documents, especially its HOWTOs. See http://www.tldp.org/. A variety of books on Linux networking have been published; I have found O'Reilly's _TCP/IP Network Administration_, by Craig Hunt to be quite helpful (find whatever edition is most current when you read this). A large number of good books have been written on working with Apache. Some are concerned with general administration, while others cover particular modules or special configurations of Apache. Check your local bookseller for a range of available titles. Out of the dozens of titles available, none stand out to this writer for special mention, though many of them are quite excellent. RUNNING APACHE ------------------------------------------------------------------------ A swarm of daemons Launching Apache is similar to launching any other daemon. Usually you wish to put its launch in your system initialization scripts, but in principle you may launch Apache at any time. On most systems, the Apache server is called 'httpd', though it may be called 'apache2' instead. The server is probably installed in '/usr/sbin/', but other locations are possible, depending on distribution and how you installed the server. Most of the time you will launch Apache with no options, but the '-d serverroot' and '-f config' options are worth keeping in mind. The first lets you specify a location on the local disks where content will be served from; the second lets you specify a non-default configuration file. A configuration file may override the '-f' option using the 'ServerRoot' directive (most do). By default, configuration files are either 'apache2.conf' or 'httpd.conf', depending on compilation options; these files might live at '/etc/apache2/', '/etc/apache/', '/etc/httpd/conf/', '/etc/httpd/apache/conf' or a few other locations, depending on version, Linux distribution, and how you installed of compiled Apache. Checking 'man apache2' or 'man httpd' should give you system-specific details. The Apache daemon is unusual when compared with other servers in that it usually creates several running copies of itself. The primary copy simply spawns the others, while these secondary copies service the actual incoming requests. The goal in having multiple running copies is to act as a "pool" for requests that may arrive in bundles; additional copies of the daemon are launched as needed, according to several configuration parameters. The primary copy usually runs as 'root', but the other copies as a more restricted user for security reasons. E.g.: # ps axu | grep apache2 root 6620 Ss Nov12 0:00 /usr/sbin/apache2 -k start -DSSL www-data 6621 S Nov12 0:00 /usr/sbin/apache2 -k start -DSSL www-data 6622 Sl Nov12 0:00 /usr/sbin/apache2 -k start -DSSL www-data 6624 Sl Nov12 0:00 /usr/sbin/apache2 -k start -DSSL dqm 313 S+ 03:44 0:00 man apache2 root 637 S+ 03:59 0:00 grep apache2 On many systems, the restricted user will be 'nobody'. In the example above it is 'www-data'. Including configuration files As mentioned in the last panel, the behavior of Apache is affected by directives in its configuration file. For Apache2 systems, the main configuration file is likely to reside at '/etc/apache2/apache2.conf'; but often this file will contain multiple "Include" statements to add configuration information from other files, possibly by wildcard pattern. Overall, an Apache configuration is likely to contain hundreds of directives and options (most not specifically documented in this tutorial). I few files are particularly like to be included. You might see 'httpd.conf' for "user" settings, and to utilize prior Apache 1.3 configuration files that use that name. Virtual hosts are typically specified in separate configuration files, matched on a wildcard, e.g.: # Include the virtual host configurations: Include /etc/apache2/sites-enabled/[^.#]* With Apache 2.x, modules are typically specified in separate configuration files as well (more often in the same file in 1.3.x). For example, a system of mine includes: #------------- From '/etc/apache2/apache2.conf' -----------------# # Include module configuration: Include /etc/apache2/mods-enabled/*.load Include /etc/apache2/mods-enabled/*.conf Actually using a module in a running Apache requires two steps in the configuration file, both loading it and enabling it, e.g.: #--------------- Loading an option Apache module ----------------# # cat /etc/apache2/mods-enabled/userdir.load LoadModule userdir_module /usr/lib/apache2/modules/mod_userdir.so # cat /etc/apache2/mods-enabled/userdir.conf UserDir public_html UserDir disabled root AllowOverride FileInfo AuthConfig Limit Options MultiViews Indexes SymLinksIfOwnerMatch IncludesNoExec The wildcards in the 'Include' lines will insert all the '.load' and '.conf' files in the '/etc/apache2/mods-enabled/' directory. One thing to notice here is a general pattern: basic directives are one line commands with some options; more complex directives nest commands using an XML-like open/close tag. You just have to know for each directive whether it is one-line or open/close style, you cannot choose among styles at will. Log files An important class of configuration directives concern logging of Apache operations. A lot of different types of information, and degree of detail, can be maintained for Apache operations. An error log is always a good thing to keep, and you can specify it with the single directive: # Global error log. ErrorLog /var/log/apache2/error.log Other logs of server access, referrers, of other information, can be customized to fit your individual setup. A logging operation is configured with two directives. First a 'LogFormat' directive uses a set of special variables to specify what goes in the log file; second a 'CustomLog' directive tells Apache to actually record events in the specified format. An unlimited number of formats may be specified, whether or not each one is actually used. This allows you to switch on and off logging details based on evolving needs. Variables in a 'LogFormat' are similar to shell variables, but with a leading '%'. Some variables have single-letter while others have long names surrounded by brackets. For example: LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined CustomLog /var/log/apache2/referer_log combined Consult a book or full Apache documentation for the list of variables. Commonly used ones include '%h' for IP address of requesting client, '%t' for datetime of the request, '%>s' for HTTP status code, and the mispelled '%{Referer}' for the referring site that led to the served page. The name used in the 'LogFormat' and 'CustomLog' directives is arbitrary. In the example, the name 'combined' was used, but it could just as well be 'myfoobarlog'. However, a few names are commonly used and come with sample configuration files, such as 'combined', 'common', 'referer', 'agent'. These specific formats are typically supported directly by log-analyzer tools. Virtual hosts, multi-homing, and per-directory options Individual directories served by an Apache server may have their own configuration options. However, the main configuration may limit which options can be configured locally. If per-directory configuration is desired, use the 'AccessFileName' directive, and typically specify the local configuration filename of '.htaccess'. The limitations of local configuration are specified within a '' directive. For example: #----------------- Example of Directory directive ---------------# #Let's have some Icons, shall we? Alias /icons/ "/usr/share/apache2/icons/" Options Indexes MultiViews AllowOverride None Order allow,deny Allow from all Often working in conjunction with per-directory options, Apache can service "virtual hosts". What this means is that multiple domain names may be served from the same Apache process, each accessing an appropriate directory. Defining virtual hosts is done with the '' directive. This may be done by placing configuration files in an included directory, such as '/etc/apache2/sites-enabled/', or it may be contained directly in a main configuration file. In use, you might specify, e.g.: #------------------ Configuring virtual hosts -------------------# ServerAdmin webmaster@foo.example.com DocumentRoot /var/www/foo ServerName foo.example.com Options Indexes FollowSymLinks MultiViews AllowOverride None Order allow,deny allow from all ScriptAlias /cgi-bin/ /usr/lib/cgi-bin/ AllowOverride None Options ExecCGI -MultiViews +SymLinksIfOwnerMatch Order allow,deny Allow from all CustomLog /var/log/apache2/foo_access.log combined DocumentRoot /var/www/bar ServerName bar.example.org DocumentRoot /var/www The final '*' option picks up any HTTP requests that are not directed to one of the explicitly specified names (e.g. addressed by IP address, or as an unspecified symbolic domain which also resolves to the server machine). For virtual domains to work, DNS must define each alias with a CNAME record. Multi-homed servers sound similar to virtual hosting, but the concept is different. Using multi-homing, you may configure which IP addresses that a machine is connected to allow web requests. For example, you might provide HTTP access only to the local LAN, but not to the outside world. If you specify an address to listen on, you may also indicate a non-default port. The default value for 'BindAddress' is '*', which means to accept requests on every IP address under which the server may be reched. A mixed example might look like: #------------------- Configuring multi-homing -------------------# BindAddress 192.168.2.2 Listen 192.168.2.2:8000 Listen 64.41.64.172:8080 In this case, we will accept all client requests from the local LAN (i.e. that use the 192.168.2.2 address) on the default port 80, and on the special port 8000. This Apache installation will also monitor client HTTP requests from the WAN address, but only on port 8080. Limiting web access You may enable per-directory server access with the 'Order', 'Allow from' and 'Deny from' commands, within a '' directive. Denied or allowed addresses may be specified by full or partial hostnames or IP addresses. 'Order' lets you give precedence between the accept list and the deny list. In many cases, you would like more fine-tuned control than simply allowing particular hosts to access your web server. To enable user login requirements, you may use the 'Auth*' family of commands, again within the '' directive. For example, to setup Basic Authentication, you might use a directive like: #--------------- Configuring Basic Authentication ---------------# AuthName "Baz" AuthType Basic AuthUserFile /etc/apache2/http.passwords AuthGroupFile /etc/apache2/http.groups Require john jill sally bob You may also specify Basic Authentication within a per-directory '.htaccess' file. Digest Authentication is more secure than Basic, but is less widely implemented in browsers. However, the weakness of Basic (that it transmits passwords in clear-text) is better solved with an SSL layer anyway. Support for SSL encryption of web traffic is provided by the module 'mod_ssl'. When SSL is used, data transmitted between server and client is encrypted with a dynamically negotiated password that is resistent to interception. All major browsers support SSL. For more information on configuring Apache 2.x with 'mod_ssl' see . RUNNING SQUID ------------------------------------------------------------------------ Installing and running Squid In most distributions, you should be able to install Squid using the normal installation procedures. You may obtain the source verion of Squid from . Building from source uses the basic './configure; make; make install' sequence. Once installed, you may simply run, as 'root', '/usr/sbin/squid' (or whatever location your distribution uses, perhaps '/usr/local/sbin/'). Of course, to do much useful, you will want to configure the Squid configuration file at '/etc/squid/squid.conf', '/usr/local/squid/etc/squid.conf', or wherever precisely your system locates 'squid.conf'. As with almost all daemons, you may use a different configuration file, in this case with the '-f' option. Ports, IP addresses, http_access and ACLs The most important configuration option for Squid is the 'http_port' options you select. You may monitor whichever ports you wish, optionally attaching each one to a particular IP address or hostname. The default port for Squid is 3128, allowing any IP address that connects to the Squid server. To cache only for a LAN, you want to specify the local IP address instead, e.g.: # default (disabled) # http_port 3128 # LAN only http_port 192.168.2.2:3128 You may also enable caching via other Squid servers using the 'icp_port' and 'htcp_port'. The IPC and HTCP protocols are used for caches to communicate between themselves, rather than by web servers and clients themselves. To cache multicasts, use 'mcast_groups'. To let clients connect to your Squid server, you need to give them permission to do so. Unlike a Web server, Squid is typically not entirely generous with its resources. In the simple case, we can just use a couple 'subnet/netmask' or CIDR (Classless Internet Domain Routing) patterns to control permissions, e.g.: #---------------- Simple Squid access permissions ---------------# http_access deny 10.0.1.0/255.255.255.0 http_access allow 10.0.0.0/8 icp_access allow 10.0.0.0/8 The 'acl' directive can be used to name access control lists (ACLs). You can name 'src' ACLs that simply specify address ranges as in the above example; but you can also create other types of ACLs. For example: #----------------- Fine tuned access permissions ----------------# acl mynetwork src 192.168/16 acl asp urlpath_regex \.asp$ acl bad_ports port 70 873 acl javascript rep_mime_type -i ^application/x-javascript$ # what HTTP access to allow classes http_access deny asp # don't cache active server pages http_access deny bad_ports # don't cache gopher or rsync http_access deny javascript # don't cache javascript content http_access allow mynetwork # allow LAN everything not denied This example gives only a small subset of the available ACL types. See a sample 'squid.conf' for example of many others. Or take a look at the documentation at . In this case, we decide not to cache URLs that end with '.asp' (probably dynamic content), not to cache ports 70 and 873, and not to cache returned javascript objects. Other than what is denied, machines on the LAN (the /16 range given) will have all their requests cached. Notice that each ACL defined has a unique, but arbitrary, name (use names that make sense, but the names are not reserved). Caching modes The simplest way to run Sqid is in proxy mode. If you do this, clients will need to be explicitly configured to use the cache. Web browser clients have configuration screens that allow them to specify a proxy address and port rather than a direct HTTP connection. This setup makes configuring Squid very simple, but makes clients do some setup work if they want to benefit from the Squid cache. You may also configure Squid to run as a transparent cache. To do this you need to either configure policy based routing (outside of Squid itself, using 'ipchains' or 'ipfilter') or use your Squid server as a gateway. Assuming you can direct external requests via the Squid server, Squid needs to be configured as follows. You may need to recompile Squid with the '--enable-ipf-transparent' option; however, in most Linux installations this should already be fine. To configure the server for transparent caching (once it gets the redirected packets), add something like the below to your 'squid.conf': httpd_accel_host virtual httpd_accel_port 80 httpd_accel_with_proxy on httpd_accel_uses_host_header on