ch007.xhtml

Words and Sequences

In the previous problem, we identified words that started with ‘x’ and ended with ‘y’. You may have noticed, however, that we had already included the assumption that all the words started with ‘x’. Perhaps your solution was clever enough not to fall for the danger shown in this puzzle. Namely, perhaps not all words will actually start with ‘x’ to begin with; i.e. if we try to apply our previous regex to such text.

>>> txt = """
expurgatory xylometer xenotime xenomorphically exquisitely
xylology xiphosurans xenophile oxytocin xylogen
xeriscapes xerochasy inexplicably exabyte inexpressibly
extremity xiphophyllous xylographic complexly vexillology
xanthenes xylenol xylol xylenes coextensively
"""
>>> pat3 = re.compile(r'x[a-z]*y')
>>> re.findall(pat3, txt)
['xpurgatory', 'xy', 'xenomorphically', 'xquisitely',
'xylology', 'xy', 'xy', 'xerochasy', 'xplicably', 'xaby',
'xpressibly', 'xtremity', 'xiphophy', 'xy', 'xly',
'xillology', 'xy', 'xy', 'xy', 'xtensively']

As you can see, we matched a number of substrings within words, not only whole words. What pattern can you use to actually match only words that start with ‘x’ and end with ‘y’?

Before you turn the page…

Think about what defines word boundaries.

There are a few ways you might approach this task. The easiest is to use the explicit “word boundary” special zero-width match pattern, spelled as \b in Python and many other regular expression engines.

>>> pat4 = re.compile(r'\bx[a-z]*y\b')
>>> re.findall(pat4, txt)
['xenomorphically', 'xylology', 'xerochasy']

Less easy ways to accomplish this include using lookahead and lookbehind to find non-matching characters that must “bracket” the actual match. For example:

>>> pat5 = r'(?<=^|(?<=[^a-z]))x[a-z]+y(?=$|[^a-z])')
>>> re.findall(pat5, txt)
['xenomorphically', 'xylology', 'xerochasy']

One trick here is that when we perform a lookbehind assertion, it must have a fixed width of the match. However, words in our list might either follow spaces or occur at the start of a line. So we need to create an alternation between the zero-width lookbehind and the one non-letter character lookbehind. For the lookahead element, it is enough to say it is either end-of-line ($) or is a non-letter ([^a-z]).