The Puzzling Quirks of Regular Expressions

  1. Acknowledgments
  2. Rights of (Wo)Man
  3. Credits
  4. Preface
  5. Quantifiers and Special Sub-Patterns
    1. Wildcard Scope
    2. Words and Sequences
    3. Endpoint Classes
    4. A Configuration Format
    5. The Human Genome
  6. Pitfalls and Sand in the Gears
    1. Catastrophic Backtracking
    2. Playing Dominoes
    3. Advanced Dominoes
    4. Sensor Art
  7. Creating Functions using Regexen
    1. Reimplementing str.count()
    2. Reimplementing str.count() (stricter)
    3. Finding a Name for a Function
    4. Playing Poker (Part 1)
    5. Playing Poker (Part 2)
    6. Playing Poker (Part 3)
    7. Playing Poker (Part 4)
    8. Playing Poker (Part 5)
  8. Easy, Difficult, and Impossible Tasks
    1. Identifying Equal Counts
    2. Matching Before Duplicate Words
    3. Testing an IPv4 Address
    4. Matching a Numeric Sequence
    5. Matching the Fibonacci Sequence
    6. Matching the Prime Numbers
    7. Matching Relative Prime Numbers
Basket_Recto

Support the author!
Lulu Editions
Paypal Donation
Other Publications

Words and Sequences

In the previous problem, we identified words that started with ‘x’ and ended with ‘y’. You may have noticed, however, that we had already included the assumption that all the words started with ‘x’. Perhaps your solution was clever enough not to fall for the danger shown in this puzzle. Namely, perhaps not all words will actually start with ‘x’ to begin with; i.e. if we try to apply our previous regex to such text.

>>> txt = """
expurgatory xylometer xenotime xenomorphically exquisitely
xylology xiphosurans xenophile oxytocin xylogen
xeriscapes xerochasy inexplicably exabyte inexpressibly
extremity xiphophyllous xylographic complexly vexillology
xanthenes xylenol xylol xylenes coextensively
"""
>>> pat3 = re.compile(r'x[a-z]*y')
>>> re.findall(pat3, txt)
['xpurgatory', 'xy', 'xenomorphically', 'xquisitely',
'xylology', 'xy', 'xy', 'xerochasy', 'xplicably', 'xaby',
'xpressibly', 'xtremity', 'xiphophy', 'xy', 'xly',
'xillology', 'xy', 'xy', 'xy', 'xtensively']

As you can see, we matched a number of substrings within words, not only whole words. What pattern can you use to actually match only words that start with ‘x’ and end with ‘y’?

Before you turn the page…

Think about what defines word boundaries.

There are a few ways you might approach this task. The easiest is to use the explicit “word boundary” special zero-width match pattern, spelled as \b in Python and many other regular expression engines.

>>> pat4 = re.compile(r'\bx[a-z]*y\b')
>>> re.findall(pat4, txt)
['xenomorphically', 'xylology', 'xerochasy']

Less easy ways to accomplish this include using lookahead and lookbehind to find non-matching characters that must “bracket” the actual match. For example:

>>> pat5 = r'(?<=^|(?<=[^a-z]))x[a-z]+y(?=$|[^a-z])')
>>> re.findall(pat5, txt)
['xenomorphically', 'xylology', 'xerochasy']

One trick here is that when we perform a lookbehind assertion, it must have a fixed width of the match. However, words in our list might either follow spaces or occur at the start of a line. So we need to create an alternation between the zero-width lookbehind and the one non-letter character lookbehind. For the lookahead element, it is enough to say it is either end-of-line ($) or is a non-letter ([^a-z]).