by David Mertz, Ph.D. <[email protected]>
In Unix-inspired operating systems such as Linux, FreeBSD, MacOSX, Solaris, AIX, and so on, a common philosophy underlies the development environment, and even just the shells and working environment. The main jist of this philsophy is using small component utilities to do each small task well (and no other thing badly), then combining utilities to perform compound tasks. Most of what has been produced by the GNU project falls under this component philosophy--and indeed the specific GNU implementations have been ported to many platforms, even ones not traditionally thought of as Unix-like. The Linux kernel, however, is of necessity a more monolithic bit of software--though even there kernel modules, filesystems, video drivers, and so on, are largely componentized.
For this column, readers should be generally familiar with some Unix-like environment, and especially with a command-line shell. Readers need not be programmers per se; in fact, the techniques described will be most useful to system administrators and users who process ad hoc reports, log files, project documentation, and the like (and less so for formal programming code processing).
If the Unix philosophy has a deontological aspect in advocating minimal modular components and cooperation, it also has an ontological aspect: "everything is a file." Abstractly, a file is simply an object that supports a few operations; firstly reading and writing bytes, but also some supporting operations like indicating its current position and knowing when it has reached its end. The Unix permission model is also oriented around its idea of file.
 Concretely, a file might be an actual region on a recordable
          media (with appropriate tagging of its name, size, position on
          disk, and so on, supplied by the filesystem). But a file might
          also be a virtual device in the  /dev/  hierarchy, or a
          remote stream coming over a TCP/IP socket or via a higher-level
          protocol like NFS. Importantly, the special files STDIN and STDOUT
          and STDERR can be used to read or write to the user console and/or
          to pass data between utilities. These special files can be
          indicated by virtual filenames, along with using special syntax:
          STDIN is  /dev/stdin  and/or  /dev/fd/0 ;
          STDOUT is  /dev/stdout  and/or  /dev/fd/1 ;
          STDERR is  /dev/stderr  and/or  /dev/fd/2 .
           
The advantage and principle of Unix' file ontology is that most of the utilities discussed here will handle various data sources uniformly and neutrally, regardless of what storage or transmission mechanisms actually underly the delivery of bytes.
The way that Unix/Linux utilities are typically combined is via piping and redirection. Many utilites either automatically or optionally take their input from STDIN, and send their output to STDOUT (with special messages sent to STDERR). A pipe sends the STDOUT of one utility to the STDIN of another utility (or to a new invocation of the same utility). A redirect either reads the content of a file as STDIN, or sends the STDOUT and/or STDERR output to a named file. Redirects are often used to save data for later or repeated processing (with the later utility runs using STDIN redirection).
 In almost all shells, piping is performed with the vertical-bar
           |  symbol, and redirection with the greater-than and
          less-than symbols:  >  and  < . To
          redirect STDERR, use  2> , or  &> 
          to redirect both STDOUT and STDERR to the same place. You may also
          use a doubled greater-than (>>) to append to the end of an
          existing file. For example: 
$ foo fname | bar - > myout 2> myerr
 The utility  foo  probably processes the file named
           fname , and outputs to STDOUT. The utility bar uses a
          common convention of specifying a dash when output is to be taken
          from STDIN rather than a named file (some other utilities only
          take STDIN).  The STDOUT from  bar  is saved in
           myout , and its STDERR in  myerr . 
The GNU Text Utilities is a collection of some of the tools for processing and manipulating text files and streams that have proven most useful, and been refined, over the evolution of Unix-like operating systems. Most of them have been part of Unix from the earliest implementations, though many have grown additional options over time.
 The suite of utilities collected in the archive  textutils-2.1  includes twenty-seven tools; however, the GNU
          project maintainers have more recently decided to bundle these
          tools instead as part of the larger collection  coreutils-5.0  (and presumably likewise for later versions).
          On systems derived from BSD rather than GNU tools, the same
          utilities might be bundled a bit differently, but most of the same
          utilities will still be present. This tutorial will focus on the
          twenty-seven utilities traditionally included in
           textutils , with some occassional mention and use of
          related tools that are generally available on Unix-like systems.
          However, I will skip the utility  ptx  (permuted
          indexes) which is both too narrow in purpose and too difficult to
          understand for inclusion here.
           
 One tool that is not  per se  part of  textutils  still deserves special mention. The utility
           grep  is one of the most widely used Unix utilities,
          and will very often be used in pipes to or from the text
          utilities. 
 What  grep  does is in one sense very simple, in
          another sense quite complex to understand. Basically,
           grep  identifies lines in a file that match a regular
          expression. Some switches let you massage the output in various
          ways, such as printing surrounding context lines, numbering the
          matching lines, or identifying only the files in which the matches
          occur rather than individual lines. But at heart,
           grep  is just a (very powerful) filter on the lines in
          a file. The complex part of  grep  is the regular
          expressions you can specify to describe matches of interest. But
          that's another tutorial (see Resources). A number of other
          utilities also support regular expression patterns, but
           grep  is the most general such tool, and hence it is
          often easier to put  grep  in your pipeline than to use
          the weaker filters other tools provide. A quick grep example:  
 
          $ grep -c [Ss]ystem$ * 2> /dev/null | grep :[^0]$
          INSTALL:1
          aclocal.m4:2
          config.log:1
          configure:1
           
The example lists files that contain lines ending with word "system", perhaps with initial cap, at the end of lines; and also show the number of such occurrences (i.e. if non-zero). (Actually, the example does not not handle counts greater than 9 properly).
 While the text utilities are designed to produce outputs in
          various useful formats--often modified by command-line
          switches--there are still times when being able to explicitly
          branch and loop is useful.  Shells like  bash  let you
          combine utilities with flow control to perform more complex
          chores.  Shell scripts are especially useful to encapsulate
          compound tasks that you will perform multiple times, especially
          those involving some parameterization of the task. 
 Explaining  bash  scripting is certainly outside the
          scope of this tutorial. See Resources for an introduction to
           bash . Once you understand the text utilities, it is
          fairly simple to combine them into saved shell scripts. Just for
          illustration, here is a quick (albeit somewhat contrived) example
          of flow control with  bash : 
 
          [~/bacchus/articles/scratch/tu]$ cat flow
          #!/bin/bash
          for fname in `ls $1`; do
            if (grep $2 $fname > /dev/null); then
              echo "Creating: $fname.new" ;
              tr "abc" "ABC" < $fname > $fname.new
            fi
          done
          [~/bacchus/articles/scratch/tu]$ ./flow '*' bash
          Creating: flow.new
          Creating: test1.new
          [~/bacchus/articles/scratch/tu]$ cat flow.new
          #!/Bin/BAsh
          for fnAme in `ls $1`; do
            if (grep $2 $fnAme > /dev/null); then
              eCho "CreAting: $fnAme.new" ;
              tr "ABC" "ABC" < $fnAme > $fnAme.new
            fi
          done
           
The simplest text utilities simply output the exact contents of a file or stream to STDOUT, or perhaps a portion or simple rearrangment of those contents.
 The utility  cat  begins with the first line and ends
          with the last line. The utility  tac  outputs lines in
          reverse.  Both utilites will read every file specified as an
          argument, but default to STDIN if none is specified.  As with many
          utilities, you may explicitly specify STDIN using the special name
           - .  Some examples: 
 
          $ cat test2
          Alice
          Bob
          Carol
          $ tac < test3
          Zeke
          Yolanda
          Xavier
          $ cat test2 test3
          Alice
          Bob
          Carol
          Xavier
          Yolanda
          Zeke
          $ cat test2 | tac - test3
          Carol
          Bob
          Alice
          Zeke
          Yolanda
          Xavier
           
 The utilities  head  and  tail  output
          only an initial or final portion of a file or stream,
          respectively. The GNU version of both utilities support the switch
           -c  to output a number of bytes; most often both
          utilities are used in their line-oriented mode which output a
          number of lines (whatever the actual line lengths).  Both
           head  and  tail  default to outputting ten
          lines.  As with  cat  or  tac ,
           head  and  tail  default to STDIN if files
          are not specified. 
 
          $ head -c 8 test2 && echo # push prompt to new line
          Alice
          Bo
          $ /usr/local/bin/head -2 test2
          Alice
          Bob
          $ cat test3 | tail -n 2
          Yolanda
          Zeke
          $ tail -r -n 2 test3 # reverse
          Zeke
          Yolanda
           
By the way, the GNU versions of these utilities (and many others) have more flexible switches than do the BSD versions.
 The  tail  utility has a special mode indicated with
          the switches  -f  and  -F  that continues to
          display new lines written to the end of a "followed" file. The
          capitalized switch watches for truncations and renaming of the
          file as well as the simple appends the lower case switch
          monitors.  Follow mode is particularly useful for watching
          changes to a log file that another process might peform
          periodically. 
 The utilities  od  and  hexdump  output
          octal, hex, or otherwise encoded bytes from a file or stream.
          These are useful for access to or visual examination of characters
          in a file that are not directly displayible on your terminal. For
          example,  cat  or  tail  do not directly
          disambiguate between tabs, spaces, or other whitespace--you can
          check which characters are used with  hexdump .
          Depending on you system type, either or both of these two
          utilities will be available--BSD systems deprecate  od 
          for  hexdump , GNU systems the reverse. The two
          utilities, however, have exactly the same purpose, just slightly
          different switches. 
 
          $ od test3 # default output format
          0000000 054141 073151 062562 005131 067554 060556 062141 005132
          0000020 062553 062412
          0000024
          $ od -w8 -x test3 # 8 hex digits per line
          0000000 5861 7669 6572 0a59
          0000010 6f6c 616e 6461 0a5a
          0000020 656b 650a
          0000024
          $ od -c test3 # 5 escaped ASCII chars per line
          0000000   X   a   v   i   e   r  \n   Y   o   l   a   n   d   a  \n   Z
          0000020   e   k   e  \n
          0000024
           
 As with other utilities,  od  and  hexdump  accept input from STDIN or from one or more named files.
          As well, the  od  switches  -j  and
           -N  let you skip initial bytes and limit the number
          read, respectively. You may customize output formats even further
          than with the standard switches using  fprintf() -like
          formatting specifiers 
 There is a special kind of redirection that is worth noting in
          this tutorial.  While HERE documents are, strictly speaking, a
          feature of shells like  bash  rather than anything to
          do with the text utilities, they provide a useful way of sending
          ad hoc data to the text utilities (or to other applications). 
 Direction with a double less-than can be used to take
          pseudo-file contents from the terminal.  A HERE document must
          specifiy a terminating delimiter immediately after its
           << .  For example: 
 
          $ od -c <<END
          > Alice
          > Bob
          > END
          0000000   A   l   i   c   e  \n   B   o   b  \n
          0000012
           
Any string may be used as a delimiter, input is terminated when the string occurs on a line by itself. This gives us a quick way to create a persistent file:
 
          $ cat > myfile <<EOF
          > Dave
          > Edna
          > EOF
          $ hexdump -C myfile
          00000000  44 61 76 65 0a 45 64 6e  61 0a            |Dave.Edna.|
          0000000a
           
Many Linux utilities view files as a line-oriented collection of records or data. This has proved a very convenient way of aggregating data collections in ways that is both readable to people, and easy to process with tools. The simple trick is to treat each newline as a delimiter between records, where each record has a similar format.
 As a practical matter, line-oriented records usually should
          have a relatively limited length--perhaps up through a few
          hundred characters.  While none of the text utilties have such a
          limit built in to them, human eyes have trouble working with
          extremely long lines, even if auto-wrapping or horizontal
          scrolling is used.  Either a more complex structured data format
          might be used in such cases, or records might be broken into
          multiple lines (perhaps flagged for type in a way that
           grep  can sort out).  As a simple example, you might
          preserve a hierarchical multi-line data format using prefix
          characters: 
 
          $ cat multiline
          A Alice Aaronson
          B System Administrator
          C 99 9th Street
          A Bob Babuk
          B Programmer
          C 7 77th Avenue
          $ grep '^A ' multiline # names only
          A Alice Aaronson
          A Bob Babuk
          $ grep '^C ' multiline # address only
          C 99 9th Street
          C 7 77th Avenue
           
 The output from one of these  grep  filters is a
          usuable newline-delimited collection of partial records with the
          field(s) of interest. 
 The utility  cut  writes fields from a file to the
          standard output, where each line is treated as a delimited
          collection of fields. The default delimiting character is a tab,
          but this can be changed with the short form option  -d <DELIM>  or the long form option  --delimiter=<DELIM> . 
 You may select one or more fields with the  -f 
          switch. The  -c  switch selects specific character
          positions from each line instead. Either switch will accept comma
          separated numbers or ranges as parameters (including open ranges).
          For example, we can see that the file  employees  is
          tab delimited: 
 
          $ cat employees
          Alice Aaronson  System Administrator    99 9th Street
          Bob Babuk       Programmer      7 77th Avenue
          Carol Cavo      Manager 111 West 1st Blvd.
          $ hexdump -n 50 -c employees
          0000000   A   l   i   c   e       A   a   r   o   n   s   o   n  \t   S
          0000010   y   s   t   e   m       A   d   m   i   n   i   s   t   r   a
          0000020   t   o   r  \t   9   9       9   t   h       S   t   r   e   e
          0000030   t  \n
          0000032
          $ cut -f 1,3 employees
          Alice Aaronson  99 9th Street
          Bob Babuk       7 77th Avenue
          Carol Cavo      111 West 1st Blvd.
          $ cut -c 1-3,20,25- employees
          Alieministrator 99 9th Street
          Bobr7th Avenue
          Car1est 1st Blvd.
           
Later examples will utilize custom delimiters, other than tabs.
 The utilties  expand  and  unexpand 
          convert tabs to spaces and vice-versa. A tab is considered to
          align at specific columns, by default every eight columns, so the
          specific number of spaces that correspond to a tab depends on
          where those spaces or tab occur. Unless you specify the
           -a  option,  unexpand  will only entab the
          initial whitespace (the default is useful for reformatting source
          code).  
 Continuing with the  employees  file of the last
          panel, we can peform some substitutions.  Notice that after you
          run  unexpand , tabs in the output may be follwed by
          some spaces in order to produce the needed overall alignment. 
 
          $ cat -T employees  # show tabs explicitly
          Alice Aaronson^ISystem Administrator^I99 9th Street
          Bob Babuk^IProgrammer^I7 77th Avenue
          Carol Cavo^IManager^I111 West 1st Blvd.
          $ expand -25 employees
          Alice Aaronson           System Administrator     99 9th Street
          Bob Babuk                Programmer               7 77th Avenue
          Carol Cavo               Manager                  111 West 1st Blvd.
          $ expand -25 employees | unexpand -a | hexdump -n 50 -c
          0000000   A   l   i   c   e       A   a   r   o   n   s   o   n  \t  \t
          0000010       S   y   s   t   e   m       A   d   m   i   n   i   s   t
          0000020   r   a   t   o   r  \t           9   9       9   t   h       S
          0000030   t   r
          0000032
           
 The  fold  utility simply forces lines in a file to
          wrap. By default, wrapping is to 80 columns, but you may specify
          other widths. You get a limited sort of word-wrap formatting with
           fold , but it will not fully rewrap paragraphs. The
          option  -s  is useful for at least forcing new line
          breaks to occur on whitespace. Using a recent article of mine as a
          source (and clipping an example portion using tools we've seen
          earlier): 
 
          $ tail -4 rexx.txt | cut -c 3-
          David Mertz' fondness for IBM dates back embarrassingly many decades.
          David may be reached at [email protected]; his life pored over at
          http://gnosis.cx/publish/. And buy his book: _Text Processing in
          Python_ (http://gnosis.cx/TPiP/).
          $ tail -4 rexx.txt | cut -c 3- | fold -w 50
          David Mertz' fondness for IBM dates back embarrass
          ingly many decades.
          David may be reached at [email protected]; his life
          pored over at
          http://gnosis.cx/publish/. And buy his book: _Text
           Processing in
          Python_ (http://gnosis.cx/TPiP/).
          $ tail -4 rexx.txt | cut -c 3- | fold -w 50 -s
          David Mertz' fondness for IBM dates back
          embarrassingly many decades.
          David may be reached at [email protected]; his life
          pored over at
          http://gnosis.cx/publish/. And buy his book:
          _Text Processing in
          Python_ (http://gnosis.cx/TPiP/).
           
 For most purposes,  fmt  is a more useful tool for
          wrapping lines than is  fold . The utility
           fmt  will wrap lines, while both preserving initial
          indentation and aggregating lines for paragraph balance (as
          needed).  fmt  is useful for formatting documents such
          as email messages before transmission or final storage.  
 
          $ tail -4 rexx.txt  | fmt -40 -w50 # goal 40, max 50
            David Mertz' fondness for IBM dates back
            embarrassingly many decades.  David may be
            reached at [email protected]; his life pored
            over at http://gnosis.cx/publish/. And
            buy his book: _Text Processing in Python_
            (http://gnosis.cx/TPiP/).
          $ tail -4 rexx.txt  | fold -40
            David Mertz' fondness for IBM dates ba
          ck embarrassingly many decades.
            David may be reached at [email protected]
          x; his life pored over at
            http://gnosis.cx/publish/. And buy his
           book: _Text Processing in
            Python_ (http://gnosis.cx/TPiP/).
           
 The GNU version of  fmt  provides several options
          regarding how indentation of first and subsequent lines is handled
          in determining indentation style.  An option particularly likely
          to be useful is  -u  which normalizes word and sentence
          spaces. 
 The utility  nl  numbers the lines in a file, with a
          variety of options for how numbers appear.  For the most part,
           cat  contains the line numbering options you will need
          for most purposes--choose the more general tool,  cat 
          when it does what you need.  Only in special cases such as
          controlling display of leading zeros is  nl  needed
          (historically,  cat  did not always include line
          numbering). 
 
          $ nl -w4 -nrz -ba rexx.txt | head -6  # width 4, zero padded
          0001    LINUX ZONE FEATURE: Regina and NetRexx
          0002    Scripting with Free Software Rexx Implementations
          0003
          0004    David Mertz, Ph.D.
          0005    Text Processor, Gnosis Software, Inc.
          0006    January, 2004
          $ cat -b rexx.txt | head -6   # don't number bare lines
               1  LINUX ZONE FEATURE: Regina and NetRexx
               2  Scripting with Free Software Rexx Implementations
               3  David Mertz, Ph.D.
               4  Text Processor, Gnosis Software, Inc.
               5  January, 2004
           
Aside from making discussions of lines within files easier, line numbers potentially provide sort or filter criteria for downstream processes.
 The utility  tr  is a powerful tool for tranforming
          the characters that occur within a file--or rather, within STDIN,
          since  tr  operates exclusively on STDIN and writes
          exclusively to STDOUT (redirection and piping is allowed, of
          course). 
 tr  is somewhat more limited in capability than is
          its big sibling  sed , which is not included in the
          text utilities (nor in this tutorial) but is still almost always
          available on Unix-like systems. Where  sed  can perform
          general replacements of regular expressions,  tr  is
          limited to replacing and deleting single characters (it has no
          real concept of context). At its most basic,  tr 
          replaces the characters of STDIN that are contained in a source
          string with those in a target string. 
 A simple example helps illustrate  tr . We might
          have a file with variable numbers of tabs and spaces, and with to
          normalize these separators, and replace them with a new
          delimiter.  The trick is to use the  -s  (squeeze) flag
          to eliminate runs of the same character:
           
 
          $ expand -26 employees | unexpand -a > empl.multitab
          $ cat -T empl.multitab
          Alice Aaronson^I^I  System Administrator^I    99 9th Street
          Bob Babuk^I^I  Programmer^I^I    7 77th Avenue
          Carol Cavo^I^I  Manager^I^I    111 West 1st Blvd.
          $ tr -s "\t " "| " < empl.multitab | /usr/local/bin/cat -T
          Alice Aaronson| System Administrator| 99 9th Street
          Bob Babuk| Programmer| 7 77th Avenue
          Carol Cavo| Manager| 111 West 1st Blvd.
           
 As well as translating explicitly listed characters,
           tr  supports ranges and several named character
          classes.  For example, to translate lower-case characters to
          upper-case, you may use either of: 
 
          $ tr "a-z" "A-Z" < employees
          ALICE AARONSON  SYSTEM ADMINISTRATOR    99 9TH STREET
          BOB BABUK       PROGRAMMER      7 77TH AVENUE
          CAROL CAVO      MANAGER 111 WEST 1ST BLVD.
          $ tr [:lower:] [:upper:] < employees
          ALICE AARONSON  SYSTEM ADMINISTRATOR    99 9TH STREET
          BOB BABUK       PROGRAMMER      7 77TH AVENUE
          CAROL CAVO      MANAGER 111 WEST 1ST BLVD.
           
If the second range is not as long as the first, the second is padded with occurrences of its last character:
 
          $ tr [:upper:] "a-l#" < employees
          alice aaronson  #ystem administrator    99 9th #treet
          bob babuk       #rogrammer      7 77th avenue
          carol cavo      #anager 111 #est 1st blvd.
           
You may also delete characters from the STDIN stream. Typically you might delete special characters like formfeeds or high-bit characters you want to filter. But for this, let us continue with the prior example:
 
          $ tr -d [:lower:] < employees
          A A     S A     99 9 S
          B B     P       7 77 A
          C C     M       111 W 1 B.
           
The tools we have seen so far operate on each line individually. Another subset of the text utilities treats files as collections of lines, and performs some kind of global manipulation on those lines.
Pipes under Unix-like operating systems can operate very efficiently in terms of memory and latency. When a process earlier in a pipe produces a line to STDOUT, that line is immediately available to the next stage. However, the below utilities will not produce output until they have (mostly) completed their processing. For large files, some of these utilities can take a while to complete (but they are nonetheless all well optimized for the tasks they perform).
 The utility  sort  does just what the name
          suggests: it sorts the lines within a file or files.  A variety of
          options exist to allow sorting on fields or character positions
          within the file, and to modify the comparison operation (numeric,
          date, case-insensitive, etc). 
A common use of sort is in combining multiple files. Building on our earlier example:
 
          $ cat employees2
          Doug Dobrovsky  Accountant      333 Tri-State Road
          Adam Aman       Technician      4 Fourth Street
          $ sort employees employees2
          Adam Aman       Technician      4 Fourth Street
          Alice Aaronson  System Administrator    99 9th Street
          Bob Babuk       Programmer      7 77th Avenue
          Carol Cavo      Manager 111 West 1st Blvd.
          Doug Dobrovsky  Accountant      333 Tri-State Road
           
Field and character position within a field may be specified as sort criteria, as may use of numeric sorting:
 
          $ cat namenums
          Alice   123
          Bob     45
          Carol   6
          $ sort -k 2.1 -n namenums
          Carol   6
          Bob     45
          Alice   123
           
 The utility  uniq  removes adjacent lines which are
          identical to each other--or if some switches are used, close
          enough to count as identical (you may skip fields, character
          postitions, or compare as case-insensitive).  Most often, the
          input to  uniq  is the output from  sort ,
          though GNU  sort  itself contains a limited ability to
          eliminate duplicate lines with the  -u  switch. 
 The most typical use of  uniq  is in the expression
           sort list_of_things | uniq , producing a list with
          just one of each item (one per line).  But some fancier uses let
          you analyze duplicates or use different duplication criteria: 
 
          $ uniq -d test5 # identify duplicates
          Bob
          $ uniq -c test5 # count occurrences
          1 Alice
          2 Bob
          1 Carol
          $ cat test4
          1       Alice
          2       Bob
          3       Bob
          4       Carol
          $ uniq -f 1 test4  # skip first field in comparisons
          1       Alice
          2       Bob
          4       Carol
           
 The utility  tsort  is a bit of an oddity in the
          text utilities collection.  The utility itself is quite useful in
          a limited context, but what it does is not something you would
          centrally think of as text processing-- tsort  performs
          a  topological  sort on a directed graph.  Don't panic just
          yet if this concept is not familiar to you: in simple terms,
           tsort  is good for finding a suitable order among
          dependencies.  For example, installing packages might need to
          occur with certain order constraints, or some system daemons
          might need to be initialized before others. 
 Using  tsort  is quite simple, really.  Just create
          a file (or stream) that lists each known dependency (space
          separated).  The utility will produce a suitable (not necessarily
          uniquely so) order for the whole collection.  E.g.: 
 
          $ cat dependencies # not necessarily exhaustive, but realistic
          libpng XFree86
          FreeType XFree86
          Fontconfig XFree86
          FreeType Fontconfig
          expat Fontconfig
          Zlib libpng
          Binutils Zlib
          Coreutils Zlib
          GCC Zlib
          Glibc Zlib
          Sed Zlibc
          $ tsort dependencies
          Sed
          Glibc
          GCC
          Coreutils
          Binutils
          Zlib
          expat
          FreeType
          libpng
          Zlibc
          Fontconfig
          XFree86
           
 The  pr  utility is a general page formatter for
          text files that provides facilities such as page headers,
          linefeeds, columnization of source texts, indentation margins, and
          configurable page and line width.  However,  pr  does
          not itself rewrap paragraphs, and so might often be used in
          conjunction with  fmt . 
 
          $ tail -5 rexx.txt | pr -w 60 -f | head
          2004-01-31 03:22                                      Page 1
            {Picture of Author: http://gnosis.cx/cgi/img_dqm.cgi}
            David Mertz' fondness for IBM dates back embarrassingly many decades.
            David may be reached at [email protected]; his life pored over at
            http://gnosis.cx/publish/. And buy his book: _Text Processing in
            Python_ (http://gnosis.cx/TPiP/).
           
 
          $ tail -5 rexx.txt | fmt -30 > blurb.txt
          $ pr blurb.txt -2 -w 65 -f | head
          2004-01-31 03:24                 blurb.txt                 Page 1
            {Picture of Author:              at [email protected]; his life
            http://gnosis.cx/cgi-bin/img_d   pored over at http://gnosis.cx
            David Mertz' fondness for IBM    And buy his book: _Text
            dates back embarrassingly many   Processing in Python_
            decades.  David may be reached   (http://gnosis.cx/TPiP/).
           
 The utility  comm  is used to compare the contents
          of already (alphabeticly) sorted files.  This is useful when the
          lines of files are considered as unordered collections of
           items .  The  diff  utility, though not included
          in the text utilities, is a more general way of comparing files
          that might have isolated modifications--but that are treated in an
          ordered manner (such as source code files or documents).  On the
          other hand, files that are considered as fields of records do not
          have any inherent order, and sorting does not change the
          information content. 
Let us look at the difference between two sorted lists of names; the columns displayed are those in first file only, those in the second only, and those in common:
 
          $ comm test2b test2c
                          Alice
          Betsy
                          Bob
          Brian
                  Cal
                          Carol
           
 Introducing an out-of-order name, we see that  diff 
          compares happily, while  comm  fails to identify
          overlaps anymore: 
 
          $ cat test2d
          Alice
          Zack
          Betsy
          Bob
          Carol
          $ diff -U 2 test2d test2c
          --- test2d      Sun Feb  1 18:18:26 2004
          +++ test2c      Sun Feb  1 18:01:49 2004
          @@ -1,5 +1,4 @@
           Alice
          -Zack
          -Betsy
           Bob
          +Cal
           Carol
          $ comm test2d test2c
                          Alice
                  Bob
                  Cal
                  Carol
          Zack
          Betsy
          Bob
          Carol
           
 The utility  join  is quite interesting; it performs
          some basic relational calculus (as will be familiar to readers who
          know relational database theory). In short,  join  lets
          you find records that share fields between (sorted) record
          collections. For example, you might be interested in which IP
          addresses have vistited both your web site and your FTP site,
          along with information on these visits (resources requested,
          times, etc., which will be in your logs).  
 To present a simple example, suppose you issue color-coded
          access badges to various people: vendors, partners, employees.
          You'd like information on which badge types have been issued to
          employees.  Notice that names are the first field in
           employees , but second in  badges , all tab
          separated: 
 
          $ cat employees
          Alice Aaronson  System Administrator    99 9th Street
          Bob Babuk       Programmer      7 77th Avenue
          Carol Cavo      Manager 111 West 1st Blvd.
          $ cat badges
          Red     Alice Aaronson
          Green   Alice Aaronson
          Red     David Decker
          Blue    Ernestine Eckel
          Green   Francis Fu
          $ join -1 2 -2 1 -t $'\t' badges employees
          Alice Aaronson  Red     System Administrator    99 9th Street
          Alice Aaronson  Green   System Administrator    99 9th Street
           
 The utility  paste  is approximately the reverse
          operation of that performed by  cut .  That is,
           paste  combines multiple files into columns, e.g.
          fields.  By default, the corresponding lines between files are tab
          separated, but you may use a different delimiter by specifying a
           -d  option. 
 While  paste  can combine unrelated files (leaving
          empty fields if one input is longer), it generally makes the most
          sense to  paste  synchronized data sources.  One
          example of this is in reorganizing the fields of an existing data
          file, e.g.:  
 
          $ cut -f 1 employees > names
          $ cut -f 2 employees > titles
          $ paste -d "," titles names
          System Administrator,Alice Aaronson
          Programmer,Bob Babuk
          Manager,Carol Cavo
           
 The flag  -s  lets you reverse the use of rows and
          columns, which amounts to converting successive lines in a file
          into delimited fields: 
 
          $ paste -s titles | cat -T
          System Administrator^IProgrammer^IManager
           
 The utility  split  simply divides a file into
          multiple parts, each one of a specified number of lines or bytes
          (the last one perhaps smaller). The parts are written to files
          whose names are sequenced with two suffix letters (by default
           xaa ,  xab , ...  xzz ) . 
 While  split  can be useful just in managing the
          size of large files or data sets, it is more intersting in
          processing more structured data. For example, in the panel "Lines
          as Records" we saw an example of splitting fields across
          lines--what if we want to assemble those back into
           employees  style tab separated fields, one per line.
          Here is a way to do it: 
 
          $ cut -b 3- multiline | split -l 3 - employee
          $ cat employeeab
          Bob Babuk
          Programmer
          7 77th Avenue
          $ paste -s employeea*
          Alice Aaronson  System Administrator    99 9th Street
          Bob Babuk       Programmer      7 77th Avenue
           
 The utility  csplit  is similar to
           split , but it divides files based on context lines
          within them, rather than on simple line/byte counts.  You may
          divide on one or more different criteria within a command, and may
          repeat each criterion however many times you wish.  The most
          interesting criterion-type is regular expressions to match against
          lines.  For example, as an odd cut-up of  multiline :
           
 
          $ csplit multiline -zq 2 /99/ /Progr/ # line 2, find 99, find Progr
          $ cat xx00
          A Alice Aaronson
          $ cat xx01
          B System Administrator
          $ cat xx02
          C 99 9th Street
          A Bob Babuk
          $ cat xx03
          B Programmer
          C 7 77th Avenue
           
The above division is a bit perverse in that it does not correspond with the data structure. A more usual approach might be to arrange to have delimiter lines , and split on those throughout:
 
          $ head -5 multiline2
          Alice Aaronson
          System Administrator
          99 9th Street
          -----
          Bob Babuk
          $ csplit -zq multiline2 /-----/+1 {*} # incl dashes at end, per chunk
          $ cat xx01
          Bob Babuk
          Programmer
          7 77th Avenue
          -----
           
Most of the tools we have seen before produce output that is largely reversible to create the original form--or at the least, each line of input contributes in some straightforward way to the output. A number of tools in the GNU Text Utilities can instead be best described as producing a summary a file. Specifically, the output of these utilities are generally much shorter than their inputs, and the utilities all discard most of the information in their input (technically, you could describe them as one-way functions .
 About the simplest one-way function on an input file is to
          count its lines, words, and/or bytes, which is what
           wc  does.  These are interesting things to know about
          a file, but are clearly non-unique among distinct files.  For
          example: 
 
          $ wc rexx.txt # lines, words, chars, name
               402    2585   18231 rexx.txt
          $ wc -w < rexx.txt # bare word count
              2585
          $ wc -lc rexx.txt # lines, chars, name
               402   18231 rexx.txt
           
 Put to a bit of use, suppose I wonder which developerWorks
          articles I have written are the wordiest, I might use (note
          inclusion of total, another pipe to  tail  could remove
          that): 
 
          $ wc -w *.txt | sort -nr | head -4
              55190 total
               3905 quantum_computer.txt
               3785 filtering-spam.txt
               3098 linuxppc.txt
           
 The utilities  cksum  and  sum  produce
          checksums and block counts of files.  The latter exists for
          historical reasons only, and implements a less robust method.
          Either utility produces a calculated value that is unlikely to be
          the same between randomly chosen files.  In particular, a checksum
          lets you establish to a reasonable degree of certainty that a file
          has not become corrupted in transmission or accidentally modified.
           cksum  implements four successively more robust
          techniques, where  -o 1  is the behavior of
           sum , and the default (no switch) is best. 
 
          $ cksum rexx.txt
          937454632 18231 rexx.txt
          $ cksum -o 3 < rexx.txt
          4101119105 18231
          $ cat rexx.txt | cksum -o 2
          47555 36
          $ cksum -o 1 rexx.txt
          10915 18 rexx.txt
           
 The utilities  md5sum  and  sha1sum  are
          similar in concept to that of  cksum .  Note, by the
          way, that in BSD-derived systems, the former command goes by the
          name  md5 .  However,  md5sum  and
           sha1sum  produce 128-bit and 160-bit checksums,
          respectively, rather than the 16-bit or 32-bit outputs of
           cksum .  Checksums are also called hashes. 
The difference in checksum lengths gives a hint as to a difference in purpose. In truth, comparing a 32-bit hash value is quite unlikely to falsely indicate that a file was transmitted correctly and left unchanged. But protection against accidents is a much weaker standard than protection against malicious tamperers. And MD5 or SHA hash is a value that is computationally infeasable to spoof. The hash length of a cryptographic hash like MD5 or SHA is necessary for its strength, but a lot more than just the length went into their design.
 The scenario to imagine is where you are sent a file on an
          insecure channel.  In order to make sure that you receive the real
          data rather than some malicious substitute, the sender publishes
          (through a different channel) an MD5 or SHA hash for the file.  An
          adversary cannot create a false file with the published MD5/SHA
          hash--the checksum, for practical purposes, uniquely identifies
          the desired file.  While  sha1sum  is actually a bit
          better cryptographically, for historical reasons,
           md5sum  is in more widespread use. 
 
          $ md5sum rexx.txt
          2cbdbc5bc401b6eb70a0d127609d8772  rexx.txt
          $ cat md5s
          2cbdbc5bc401b6eb70a0d127609d8772  rexx.txt
          c8d19740349f9cd9776827a0135444d5  metaclass.txt
          $ md5sum -cv md5s
          rexx.txt       OK
          metaclass.txt  FAILED
          md5sum: 1 of 2 file(s) failed MD5 check
           
A weblog file provides a good data source for demonstrating a variety of real world uses of the text utilities. Standard Apache log files contain a variety of space-separated fields per line, with each line describing one access to a web resource. Unfortunately for us, spaces also occur at times inside quoted fields, so processing is often not quite as simple as we might hope.(or as it might be if the delimiter was excluded from the fields). Oh well, we must work with what we are given.
Let us take a look at a line from one of my weblogs before we perform some tasks with it:
 
          $ wc access-log
             24422  448497 5075558 access-log
          $ head -1 access-log | fmt -25
          62.3.46.183 - -
          [28/Dec/2003:00:00:16 -0600]
          "GET /TPiP/cover-small.jpg
          HTTP/1.1" 200 10146
          "http://gnosis.cx/publish/programming/regular_expressions.html"
          "Mozilla/4.0 (compatible;
          MSIE 6.0; Windows NT 5.1)"
           
 We can see that the original file is pretty large--24k records.
          Wrapping the fields with  fmt  does not always
          wrap on field boundaries, but the quotes let you see what the
          fields are. 
A very simple task to peform on a weblog file is to extract all the IP addresses of visitors to the site. This combines a few of our utilities.in a common pipe pattern (let us only look at the first few):
 
          $ cut -f 1 -d " " access-log | sort | uniq | head -5
          12.0.36.77
          12.110.136.90
          12.110.238.249
          12.111.153.49
          12.13.161.243
           
We might wonder, as well, just how many such distinct visitors have visited in total:
 
          $ cut -f 1 -d " " access-log | sort | uniq | wc -l
              2820 
           
In the last panel, we determined how many visitors our website got, but perhaps we are also interested in how much each of those 2820 visitors contribute to the overal 24422 hits. Or specifically, who are the most frequent visitors. In one line we can run:
 
          $ cut -f 1 -d " " access-log | sort | uniq -c | sort -nr | head -5 
          1264 131.111.210.195
           524 213.76.135.14
           307 200.164.28.3
           285 160.79.236.146
           284 128.248.170.115
           
While this approach works, it might be nice to pull out the histogram part into a reusable shell script:
 
          $ cat histogram 
          #!/bin/sh
          sort | uniq -c | sort -nr | head -n $1
          $ cut -f 1 -d " " access-log | ./histogram 3
          1264 131.111.210.195
           524 213.76.135.14
           307 200.164.28.3
           
 Now we can pipe any line-oriented list of items to our
           histogram  shell script.  The number of most frequent
          items we want to display is a parameter passed to the script. 
Sometimes existing data files contain information we need, but not necessarily in the arrangement needed by a downstream process. As a basic example, suppose you want to pull several fields out of the weblog shown above, and combine them in different order (and skipping unneeded fields):
 
          $ cut -f 6 -d \" access-log > browsers
          $ cut -f 1 -d " " access-log > ips
          $ cut -f 2 -d \" access-log | cut -f 2 -d " " 
                 | tr "/" ":" > resources
          $ paste resources browsers ips > new.report
          $ head -2 new.report | tr "\t" "\n"
          :TPiP:cover-small.jpg
          Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
          62.3.46.183
          :publish:programming:regular_expressions.html
          Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
          62.3.46.183
           
 The line the produces  resources  uses two passes 
          through  cut , with different delimiters.  That is 
          because what  access-log  thinks of as a REQUEST 
          contains more information than we want as a RESOURCE, i.e.: 
 
          $ cut -f 2 -d \" access-log | head -1
          GET /TPiP/cover-small.jpg HTTP/1.1
           
We also decide to massage the path delimiter in the Apache log to use a colon path separator (which is the old MacOS format, but we really just do it here to show a type of operation).
This next script combines much of what we have seen in this tutorial into a rather complex pipeline. Suppose that we know we want to cut a field from a data file, but we do not know its field position. Obviously, visual examination could provide the answer, but to automate processing different types of data files, we can cut whichever field matches a regular expression:
 
          $ cat cut_by_regex
          #!/bin/sh
          # USAGE: cut_by_regex <pattern> <file> <delim> 
          cut -d "$3" -f \
            `head -n 1 $2 | tr "$3" "\n" | nl | \
             egrep $1 | cut -f 1 | head -1` \
            $2 
           
In practice, we might use this as, e.g.:
 
          $ ./cut_by_regex "([0-9]+\.){3}" access-log " " | ./histogram 3
          1264 131.111.210.195
           524 213.76.135.14
           307 200.164.28.3
           
 Several parts of this could use further explanation.  The 
          backtick is a special syntax in  bash  to treat the 
          result of a command as a an argument to another command. 
          Specifically, the pipe in the backticks produces the  first 
          field number that matches the regular expression given as the
          first argument.  How does it mangage this? First we pull off only
          the first line of the data file; then we transform the specified
          delimiter to a newline (one field per line now); then we number
          the resulting lines/fields; then we search for a line with the 
          desired pattern pattern; then we cut just the field number from 
          the line; and finally, we only take the first match, even if 
          several fields match.  It takes a bit of thought to put a good
          pipeline together, but a lot can be done this way 
This tutorial only directly presents a small portion of what you can achieve with creative use of the GNU Text Utilities. The final few examples start to give a good sense of just how powerful they can be with creative use of pipes and redirection. The key is to break an overall transformation down into useful intermediate data, either saving that intermediary to another file or piping it to a utility that deals with that data format.
I wish to thank my colleage Andrew Blais for assistance in preparation of this tutorial.
You can download the 27 GNU Text Utilities from their FTP site.
The most current utilities have been incorporated into the GNU Core Utilies (88 in all).
Peter Seebach's The art of writing Linux utilities: Developing small, useful command-line tools
 David Mertz's  Regular Expression Tutorial  is a
          good starting point for understanding the tools like
           grep  and  csplit  that utilize regular
          expressions. 
David's book Text Processing in Python (Addison Wesley, 2003; ISBN: 0-321-11254-7) also contains an introduction to regular expressions, as well as extensive discussion of performing many of the techniques in this tutorial using Python.
The Linux Zone article Scripting with Free Software Rexx Implementations , written by David Mertz, might be useful for an alternative approach to simple text processing tasks. The scope of the text utilities is nearly identical to the core purpose of the Rexx programming language.
Please let us know whether this tutorial was helpful to you and how we could make it better. We'd also like to hear about other tutorial topics you'd like to see covered.
For questions about the content of this tutorial, contact the author, David Mertz, at [email protected] .