The Puzzling Quirks of Regular Expressions

  1. Acknowledgments
  2. Rights of (Wo)Man
  3. Credits
  4. Preface
  5. Quantifiers and Special Sub-Patterns
    1. Wildcard Scope
    2. Words and Sequences
    3. Endpoint Classes
    4. A Configuration Format
    5. The Human Genome
  6. Pitfalls and Sand in the Gears
    1. Catastrophic Backtracking
    2. Playing Dominoes
    3. Advanced Dominoes
    4. Sensor Art
  7. Creating Functions using Regexen
    1. Reimplementing str.count()
    2. Reimplementing str.count() (stricter)
    3. Finding a Name for a Function
    4. Playing Poker (Part 1)
    5. Playing Poker (Part 2)
    6. Playing Poker (Part 3)
    7. Playing Poker (Part 4)
    8. Playing Poker (Part 5)
  8. Easy, Difficult, and Impossible Tasks
    1. Identifying Equal Counts
    2. Matching Before Duplicate Words
    3. Testing an IPv4 Address
    4. Matching a Numeric Sequence
    5. Matching the Fibonacci Sequence
    6. Matching the Prime Numbers
    7. Matching Relative Prime Numbers

Support the author!
Lulu Editions
Paypal Donation
Other Publications

Endpoint Classes

This puzzle continues the word matching theme of the last two puzzles. However, here we have a new wrinkle. We would like to identify both words that start with ‘x’ and end with ‘y’, but also words that start with ‘y’ and end with ‘x’.

Remembering the word boundary special zero-width pattern we already saw, a first try at this task might be:

>>> txt = """
expurgatory xylometer yex xenomorphically exquisitely
xylology xiphosurans xenophile yunx oxytocin xylogen
xeriscapes xerochasy inexplicably yonderly inexpressibly
extremity xerox xylographic complexly vexillology
xanthenes xylenol xylol yexing xylenes coextensively

>>> pat6 = re.compile(r'\b[xy][a-z]*[xy]\b')

>>> re.findall(pat6, txt)
['yex', 'xenomorphically', 'xylology', 'yunx', 'xerochasy',
'yonderly', 'xerox']
"""

What went wrong there? Clearly we matched some words we do not want, even though all of them began with ‘x’ or ‘y’ and ended with ‘x’ or ‘y’.

Before you turn the page…

Try to refine the regular expression to match what we want.

The first pattern shown allows for either ‘x’ or ‘y’ to occur at either the beginning or the end of a word. The word boundaries are handled fine, but this allows words both beginning and ending with ‘x’, and likewise beginning and ending with ‘y’. The character classes at each end of the overall pattern are independent.

This may seem obvious on reflection, but it is very much like errors I myself have made embarrassingly many times in real code. A robust approach is simply to list everything you want as alternatives in a pattern.

>>> pat7 = re.compile(r'\b((x[a-z]*y)|(y[a-z]*x))\b')
>>> [m[0] for m in re.findall(pat7, txt)]
['yex', 'xenomorphically', 'xylology', 'yunx', 'xerochasy']

In this solution, there is a little bit of Python-specific detail in the function API. The function re.findall() returns tuples when a pattern contains multiple groups. Group 1 will be the whole word, but one or the other of group 2 and 3 will be blank, i.e.:

>>> re.findall(pat7, txt)
[('yex', '', 'yex'),
('xenomorphically', 'xenomorphically', ''),
('xylology', 'xylology', ''),
('yunx', '', 'yunx'),
('xerochasy', 'xerochasy', '')]