Charming Python #b13: The itertools Module

Functional programming in Python becomes lazy


David Mertz, Ph.D.
Idempotentate, Gnosis Software, Inc.
April, 2003

Python 2.2 introduced simple generators to Python language, and reconceived standard loops in terms of underlying iterators. With Python 2.3, generators become standard (no need for _future_), and the new module itertools is introduced to work flexibly with iterators. Over time, I think more capabilities will be introduced to this module, but already it is a very interesting thing. The itertools modules is essentially a set of combinatorial higher-order functions, but ones that work with lazy iterators rather than with finite lists. This installment explores the new module, and gives reader a sense of the new expressive power available with combinatorial iterators.

Understanding A New Concept

The idea of iterators was introduced into Python with version 2.2. Well, that is not quite true: hints of the idea were already present in the older function xrange(), and the file method .xreadlines(). Python 2.2 generalized the notion in much of its internal implementation, and made programming custom iterators much more straightforward by introducing the yield keyword (the presence of yield turns a function into a generator, which in turn returns an iterator).

The motivation behind iterators is twofold: (1) Working with data as sequences is very frequently the most straightforward approach; (2) A sequence that is processed in linear order often does not need to actually exist all at once.

The x*() premonitions provide clear examples of these principles. If you want to do something a billion times, your program will probably take a while to execute, but there is no general need for it to claim a lot of memory. Likewise, for many types of files, processing can be a line-by-line matter, and there is no need to store the whole file in memory. All sorts of other sequences can best be treated lazily too--they might depend on data arriving incrementally over a channel, or upon a computation that proceeds step-by-step.

Most of the time, an iterator is utilized within a for loop, just like a true sequence is. Iterators provide a .next() method that can be invoked explicitly, but 99% of the time, what you will see is along the line of:

for x in iterator:
    do_something_with(x)

The loop is terminated when the behind-the-scenes call to iterator.next() raises a StopIteration exception. By the way, a true sequence can be turned into an iterator by calling iter(seq)--this will not save any memory, but it can be useful in the functions discussed below.

Python's Evolving Split Personality

There is something schizophrenic in Python's attitudes towards functional programming (FP). On the one hand, many Python developers disparage the traditional FP functions map(), filter(), and reduce()--usually recommending using list comprehensions in their place. But the whole of the itertools module is composed of function of the very same sort, merely operating on "lazy sequences" (iterators) rather than on completed sequences (lists, tuples). Furthermore, there is no syntax in Python 2.3 for "iterator comprehensions," which would seem to have to same motivation as list comprehensions.

I suspect Python will eventually grow some form of iterator comprehension; but it depends on finding a suitably natural syntax for them. In the meanwhile, we have a number of useful combinatorial functions in the itertools module. Broadly, what every one of these functions do is accept some parameters--usually including some basis iterators--and return a new iterator. For example, the functions ifilter(), imap(), and izip() are directly equivalent to the respective builtins that lack the initial i.

Missing Equivalents

There is no ireduce(), in itertools, although it might seem natural; a possible Python implementation is:

Sample implementation of ireduce()

def ireduce(func, iterable, init=None):
    if init is None:
        iterable = iter(iterable)
        curr = iterable.next()
    else:
        curr = init
    for x in iterable:
        curr = func(curr, x)
        yield curr

The use case for ireduce() is similar to that for reduce(). For example, suppose you wanted to add a list of numbers contained in a large file, but stop when a condition is met. You could monitor the ongoing total with:

from operator import add
from itertools import *
nums = open('numbers')
for tot in takewhile(condition, ireduce(add, imap(int, nums)):
    print "total =", tot

A more real-world example is probably something like applying a stream of events to a stateful object, such as to a GUI widget. But even the simple example above shows the FP flavor of iterator combinators.

Basic Iterator Factories

All the functions in in itertools can easily be implemented in pure Python, as generators. The main point of including the module in Python 2.3+ is to provide standard behaviors and names for some useful functions. While programmers could write their own versions, each one would create a slightly incompatible variation in practice. The other point, however, is to implement iterator combinators in efficient C code. Using itertools functions will be a bit faster than writing your own combinators. The standard documentation shows equivalent pure-Python implementations for each itertools function, so there is no need to repeat those in this article.

The functions in itertools are basic enough--and distinctly named enough--that it probably makes sense to import all the names from the module. The function enumerate(), for example, might sensibly live in itertools, but is instead a builtin function in Python 2.3+. Notably, you can easily express enumerate() in terms of itertools functions:

Reimplementation of enumerate builtin

from itertools import *
enumerate = lambda iterable: izip(count(), iterable)

Let us look first at the few itertools functions that do not use other iterators as a basis, but simply create iterators "from scratch." times() returns an iterator that yields an identical object multiple times; in itself, this capability is moderately useful, but it is really nice as a substitute for the superfluous use of xrange() and index variable to simply repeat an action. I.e., rather than:

for i in xrange(1000):
    do_something()

You can now use the more neutral:

for _ in times(1000):
    do_something

If no second argument is given to times(), it simply yields None repeatedly. The function repeat() is similar to times(), but unboundedly returns the same object. This iterator is useful either where a loop has an independent break condition, or within combinators like izip() and imap().

The function count() is a bit like a cross between repeat() and xrange(). count() returns consecutive integers unboundedly (starting at an optional argument). However, given that count() does not currently support overflow to longs correctly now, you might as well use xrange(n,sys.maxint) still; it's not literally unbounded, but for most purposes it amounts to the same thing. Like repeat(), count() is particularly useful inside other iterator combinators.

Combinatorial Functions

A few of the real combinatorial functions in itertools have been mentioned in passing. ifilter(), izip(), and imap() act just as you would expect from their corresponding sequence functions. ifilterfalse() is a convenience so that you do not need to negate a predicate function in a lambda or a def (and it saves significant function call overhead). But functionally, you could define ifilterfalse() (approximately, ignoring the None predicate) as:

Approximate implementation of ifilterfalse()

def ifilterfalse(predicate, iterable):
    return ifilter(lambda predicate: not predicate, iterable)

The functions dropwhile() and takewhile() partition an iterator by a predicate. The former ignores yielded elements until a predicate is fulfilled, the latter yields while a predicate holds. dropwhile() skips an indefinite number of initial elements of an iterator, so might not begin iterating until after a delay. takewhile() starts right away, but terminates the iterator if the passed in predicate become true.

The function islice() is basically just the iterator version of list slicing. You can specify a start, stop, and step, just as with regular slices. If a start is given, a number of elements are dropped until the passed in iterator reaches the element of interest. This is another case where I think refinement to Python is possible--the best thing would be for iterators to simply recognize slices, just as lists do (as a synonym for what islice() does).

One final function, starmap() is a slight variation on imap(). If the function passed in as an argument takes multiple arguments, the iterable passed should yield tuples of the right size. Basically this is the same as imap() with several iterables passed in, only with the collection of iterables previously combined with izip().

Beyond The Basics

The functions included in itertools make for a good start. If nothing else, they encourage Python programmers to become more comfortable with utilizing and combining iterators. In a general way, widespread use of iterators is clearly important to the future of Python. But past what is included, there are some others that I would recommend for future updates to the module. Readers can easily use these immediately--of course, if they are later included, the names or interface details could differ.

One category that would seem to be of general use is functions that take multiple iterables as arguments, then yield individual elements from each iterable. In contrast to this, izip() yields tuples of elements, and imap() yields values computed based on the basis elements. The two obvious arrangements, to my mind, are chain() and weave(). The first is similar in effect to concatenation of sequences (but appropriately lazy). That is, where for plain sequences you might use, e.g.:

for x in ('a','b','c') + (1, 2, 3):
    do_something(x)

For general iterables, you could use:

for x in chain(iter1, iter2, iter3):
    do_something(x)

A Python implementation is:

Sample implementation of chain()

def chain(*iterables):
    for iterable in iterables:
        for item in iterable:
            yield item

With iterables, you might also combine several by interspersing them. There is not builtin syntax to do the same with sequences, but weave() itself works fine for completed sequences also. A possible implementation is below (Magnus Lie Hetland proposed a similar function on comp.lang.python):

Sample implementation of weave()

def weave(*iterables):
    "Intersperse several iterables, until all are exhausted"
    iterables = map(iter, iterables)
    while iterables:
        for i, it in enumerate(iterables):
            try:
                yield it.next()
            except StopIteration:
                del iterables[i]

Let me illustrate the behavior of weave() since it might not be immediately obvious from the implementation:

>>> for x in weave('abc', xrange(4), [10,11,12,13,14]):
...     print x,
...
a 0 10 b 1 11 c 2 12 13 3 14

Even after some iterators are exhausted, the remaining ones continue to yield values, until everything available is yielded at some point.

I will propose just one more possible itertools function. This one owes a lot to functional programming ways of conceiving problems. icompose() has a certain symmetry with the function ireduce() suggested above. But where ireduce() feeds a (lazy) sequence of values to a function, and yields each result, icompose() applies a sequence of functions to the return value of each predecessor function. A likely use of ireduce() is to feed a sequence of events to an long-lived object. icompose() might instead repeatedly pass an object to a mutator function that returns a new object. The first is a fairly traditional OOP way of thinking about events, while the second is more of an FP way of thinking.

Here is a possible implementation of icompose():

def icompose(functions, initval):
    currval = initval
    for f in functions:
        currval = f(currval)
        yield currval

Conclusion

Iterators--conceived as lazy sequences--are a powerful concept that opens new styles of Python programming. There is a subtle difference, though, between thinking of an iterator as just a data source, and thinking of it in a sequential style. Neither way of thinking is more true per se, but the latter opens the path towards a combinatorial shorthand for manipulating programmatic events. The combinatorial functions in itertools--and especially some it might grow, like those I suggest--come close to a declarative style of programming. To my mind, these declarative styles are generally less error-prone, and more powerful.

Resources

Charming Python: Iterators and simple generators

http://www-106.ibm.com/developerworks/linux/library/l-pycon

Charming Python: Implementing "weightless threads" with Python

http://www-106.ibm.com/developerworks/linux/library/l-pythrd.html

Charming Python: Generator-based state machines

http://www-106.ibm.com/developerworks/linux/library/l-pygen.html

About The Author

Picture of Author While David Mertz also likes hubris and impatience, this installment is about laziness. David may be reached at [email protected]; his life pored over athttp://gnosis.cx/publish/. Suggestions and recommendations on this, past, or future, columns are welcomed.