CESearch Technical Documentation

(This document lives at http://www.gnosis.cx/docs/cestatus_doc.html)

If you have received a printed version, consult the URL for the most current version

Introduction to CESearch

[Description of what CESearch is]

Regular expressions in Course Comments

To facilitate automated searches for courses matching criteria, cesearch.html has been generated with embedded HTML comments that concisely describe the features of each course. A TARGET tag surrounds each such HTML course comment; the line containing this target and comment will always be placed immediately at the beginning of each course description. By performing a structured search for course summary comments, the list of targets matching a criteria can also be easily extracted, and can further serve as the basis for a dynamic list of hyperlinks. In particular, the structure of these course comments is designed to allow easy regular expression pattern matching for specified criteria.

Although the course summary comments will be generated according to an automated procedure, by an HTML pre-processor, the form of the comments will remain easily human readable and parsable. Should the need arise to manually create these summary comments, the task should remain easily performable by a person having a basic familiarity with the structure.

The specific structure of course summary comments is documented below. For presentation purposes, lines of example HTML will appear wrapped. However, to accomodate line-oriented regular expression tools, the actual cesearch.html file will contain single lines for each comment/target. A typical comment/target line might look like the below:

    <A TARGET="prv_fmi_annuities">
    <!-- Description: FMI Annuities;
         Provider: FMI;
         Lines: Life Health;
         Cost: $XII;
         Media: CBT;
         Credits: AL_X CA_VII CO_XII WY_III;
         Topics: Investment Retirement Income; --> </A>

The structure here is perhaps just slightly cryptic, but upon examination, it can be seen how well the form lends itself to regular expression pattern matching (and thereby avoids requiring more complex programmatic analysis). The main "trick" is the use of Roman numerals to code cost and credit values. These have the nice quality of occurring in a strictly descending order of significance which makes the comparisons "at least" and "no more than" much easier to formulate in regular expressions than are such comparisons on Arabic numerals.

Within the information provided, the order of listing is strictly required. Each keyword occurs in the specified order, followed by a colon-space, followed by a space seperated list of values, terminated by a semicolon. A search simply needs to look for the relevant values after each keyword. An example would help clarify. Suppose we wish to find a course that applies to a Life license, costs less than $50, is offered as a CBT, and provides at least 10 Alabama credits. This search could be formulated as the following regular expression:

    \(Lines: .*Life.*Cost: \$[XVI].*Media: .*CBT.*Credits: .*AL_[XLCDM]\)

Like most moderately complex regular expressions, the above one takes a moment to understand. But really it is pretty simple. We require each of the keywords to occur, followed by a colon-space, then any number of other characters, and finally by the relevant keyword value. The whole regular expression is just a concatenation of the several keyword/value pairs. In the case of the Cost and Credits keywords, we use the Roman numerals to match across a set of values. For example, if we want to pay less that $50, we know that the Roman numeral "digit" after the dollar sign cannot be "L", or greater. Therefore, if the first Roman numeral "digit" after the dollar sign is a "X", "V" or "I", then the total is guaranteed less than $50. Similarly, in assuring the number of credits is sufficient, we need to look for "sufficiently large" initial "digits" ("C", "D", and "M" are probably unnecessary to include, since no course will offer >100 credits; but for completeness, we can use them in the list).

If one is a little bit paranoid about avoiding a mismatch, it would be possible to include the "<!-- Description: " string that begins the comment. However, it is extremely unlikely that any general HTML will accidentally match the above regular expression.

The purpose of regular expressions like the example described is not, of course, that end users will be able to compose these expressions themselves. Rather, end users will be presented with a form/dialog interface in which they are presented with choices of minimum, maximum, or listed values for different course features (maximum for cost, minimum for credits-needed, specific value for other criteria). From there, it is pretty easy to programmatically generate a regular expression like the above, and feed that expression to tools such as grep, sed, awk, Perl, or Python (and others).

Most likely the predominant method of pattern matching will take the specific form of a user filling out a form on a web page; and after the user presses the "Submit" button a CGI script creates and runs a regular expression based on the values specified, then spits back out a list of targets (utilizing the "Description:" keyword value to present to the user). However, despite the likelihood of this described use, the overall design allows flexibility to utilize the same cesearch.html source file in the context of a standalone (non-networked) application, or in batch processes. We might later decide to perform automatic filtering, sorting, or other such batch processes on the collection of Courses, all using the discussed regular expression patterns/rules.