Python Iterators

Part 1 : The Basics

Why study iterators?

The design of the Python language is closely tied to the ideas of iterators and iteration.

Other authors have noticed this too.

“The use of iterators pervades and unifies Python”

9. Classes — Python 2.7.13 documentation. (2017). Docs.python.org. Retrieved 6 July 2017, from https://docs.python.org/2/tutorial/classes.html#iterators

“Iterators are the “secret sauce” of Python 3. They’re everywhere, underlying everything, always just out of sight. Comprehensions are just a simple form of iterators. Generators are just a simple form of iterators. A function that yields values is a nice, compact way of building an iterator without building an iterator.”

Classes & Iterators - Dive Into Python 3. (2017). Diveintopython3.net. Retrieved 6 July 2017, from http://www.diveintopython3.net/iterators.html

Iterable types

Python has many standard data types that are iterable

And you, as the programmer, can make any of your classes iterable by defining a few special methods. We’ll show you how to do this later.

Definitions

First, what do we mean by the terms iterable and iterator?

The official Python Glossary says

iterable

An object capable of returning its members one at a time. Examples of iterables include all sequence types (such as list, str, and tuple) and some non-sequence types like dict, file objects, and objects of any classes you define with an __iter__() or __getitem__() method. Iterables can be used in a for loop and in many other places where a sequence is needed (zip(), map(), …). When an iterable object is passed as an argument to the built-in function iter(), it returns an iterator for the object. This iterator is good for one pass over the set of values. When using iterables, it is usually not necessary to call iter() or deal with iterator objects yourself. The for statement does that automatically for you, creating a temporary unnamed variable to hold the iterator for the duration of the loop. See also iterator, sequence, and generator.

Docs.python.org. (2017). Glossary — Python 3.6.2rc1 documentation. [online] Available at: https://docs.python.org/3/glossary.html#term-iterator [Accessed 6 Jul. 2017].

iterator

An object representing a stream of data. Repeated calls to the iterator’s __next__() method (or passing it to the built-in function next()) return successive items in the stream. When no more data are available a StopIteration exception is raised instead. At this point, the iterator object is exhausted and any further calls to its __next__() method just raise StopIteration again. Iterators are required to have an __iter__() method that returns the iterator object itself so every iterator is also iterable and may be used in most places where other iterables are accepted. One notable exception is code which attempts multiple iteration passes. A container object (such as a list) produces a fresh new iterator each time you pass it to the iter() function or use it in a for loop. Attempting this with an iterator will just return the same exhausted iterator object used in the previous iteration pass, making it appear like an empty container.

Docs.python.org. (2017). Glossary — Python 3.6.2rc1 documentation. [online] Available at: https://docs.python.org/3/glossary.html#term-iterator [Accessed 6 Jul. 2017].

Explicit vs. implicit iteration

Your programs can perform iteration in one of two ways

Explicit iteration with loop statements

Let’s start by looking at Python’s two loop statements

These let us iterate explicitly.

Iterating with a while statement

There are 2 ways to iterate with a while loop

  1. using a loop counter variable
  2. using the iter and next functions

Iterating with a loop counter variable

This style of iteration is commonly found in languages like C or Java. It works like this

index = -1
while True:
    try:
        index += 1
        value = iterable[index]
    except IndexError:
        break
    else:
        # code block

This method only works if the iterable implements the subscript [ ] operator. Some iterables, like sets, do not.

The iter and next functions.

You might not be familiar with these two functions. Here’s what they do

iter(o[, sentinel])
    Return an iterator object.
next(iterator[, default])
    Retrieve the next item from the 
    iterator by calling its next() 
    method. If default is given, it is 
    returned if the iterator is 
    exhausted, otherwise StopIteration is 
    raised.

As you can see, they are designed to be used together.

Using iter and next

Here’s a while loop that works for any iterable object. You use iter to create an iterator object, then use the next function to read values from that iterator. The iterator raises a StopIteration exception when it is exhausted.

it = iter(iterable)
while True:
    try:
        value = next(it)
    except StopIteration:
        break
    else:
        # code block

Iterating with a for statement

The for statement greatly simplifies iteration by hiding all the try/except machinery. It’s still there but you don’t have to think about it.

Behind the scenes, the for statement calls iter() on the container object. The function returns an iterator object that defines the method next() which accesses elements in the container one at a time. When there are no more elements, next() raises a StopIteration exception which tells the for loop to terminate.

Classes — Python 2.7.13 documentation. (2017). Docs.python.org. Retrieved 6 July 2017, from https://docs.python.org/2/tutorial/classes.html#iterators

It’s important to remember that for loops do not count, they iterate.

Syntax

for value in iterable:
    # code block

A counting for loop

Use the enumerate function when you need to know how many times you’ve gone through a loop.

NOTE: by default, enumerate starts counting at zero. This is consistent with list and tuple indexes which also start at zero.

for count, value in enumerate(iterable):
    # code block

Summary

Python’s iter and next functions allow you to iterate over collections of data with a fine level of control. But you almost never need to use them. The simple for statement usually does what you need.

Part 2 : Generators

Generator expressions

Generator expressions were described in PEP 289 and accepted into Python version 2.4.

Syntax

Generator expressions mimic the syntax of list comprehensions. This is by design.

(output-expression for variable in input-expression)

The generator expression can include an optional if clause. The output-expression is evaluated only when the condition is True.

(output-expression for variable in input-expression if condition)

Lazy evaluation

Generator expressions are lazy. Values are produced one at a time when they are needed. Generating a million values takes no more memory than generating one or two.

Limitations

There are some statements that cannot appear inside a generator expression

Continue statements

There is no way to skip over an iteration. However, the output-expression is not evaluated when the if clause is False.

Break statements

A generator expression cannot contain a break statement. But breaking out of an enclosing for statement achieves the same effect

for value in generator-expression:
    if condition:
        break

Generator functions

Part 3 : itertools

The itertools module provides two types of resources

  1. a collection of functions that you can import and use directly
  2. a set of recipes that you can copy and paste into your programs

The itertools functions

The itertools recipes

The more-itertools package

Author: Erik Rose
Install from: https://pypi.python.org/pypi/more-itertools

This package contains the recipes from the itertools module plus some additional functions and recipes.

I strongly suggest installing this package so you have easy access to the itertools recipes.

Appendixes

PEP 289

PEP: 289
Title: Generator Expressions
Author: python@rcn.com (Raymond Hettinger)
Status: Final
Type: Standards Track
Created: 30-Jan-2002
Python-Version: 2.4
Post-History: 22-Oct-2003

Abstract

This PEP introduces generator expressions as a high performance,
memory efficient generalization of list comprehensions [1] and
generators [2]
.

Rationale

Experience with list comprehensions has shown their widespread
utility throughout Python. However, many of the use cases do
not need to have a full list created in memory. Instead, they
only need to iterate over the elements one at a time.

For instance, the following summation code will build a full list of
squares in memory, iterate over those values, and, when the reference
is no longer needed, delete the list::

sum([x*x for x in range(10)])

Memory is conserved by using a generator expression instead::

sum(x*x for x in range(10))

Similar benefits are conferred on constructors for container objects::

s = Set(word  for line in page  for word in line.split())
d = dict( (k, func(k)) for k in keylist)

Generator expressions are especially useful with functions like sum(),
min(), and max() that reduce an iterable input to a single value::

max(len(line)  for line in file  if line.strip())

Generator expressions also address some examples of functionals coded
with lambda::

reduce(lambda s, a: s + a.myattr, data, 0)
reduce(lambda s, a: s + a[3], data, 0)

These simplify to::

sum(a.myattr for a in data)
sum(a[3] for a in data)

List comprehensions greatly reduced the need for filter() and map().
Likewise, generator expressions are expected to minimize the need
for itertools.ifilter() and itertools.imap(). In contrast, the
utility of other itertools will be enhanced by generator expressions::

dotproduct = sum(x*y for x,y in itertools.izip(x_vector, y_vector))

Having a syntax similar to list comprehensions also makes it easy to
convert existing code into a generator expression when scaling up
application.

Early timings showed that generators had a significant performance
advantage over list comprehensions. However, the latter were highly
optimized for Py2.4 and now the performance is roughly comparable
for small to mid-sized data sets. As the data volumes grow larger,
generator expressions tend to perform better because they do not
exhaust cache memory and they allow Python to re-use objects between
iterations.

BDFL Pronouncements

This PEP is ACCEPTED for Py2.4.

The Details

(None of this is exact enough in the eye of a reader from Mars, but I
hope the examples convey the intention well enough for a discussion in
c.l.py. The Python Reference Manual should contain a 100% exact
semantic and syntactic specification.)

  1. The semantics of a generator expression are equivalent to creating
    an anonymous generator function and calling it. For example::

    g = (x**2 for x in range(10))
    print g.next()

    is equivalent to::

    def __gen(exp):
        for x in exp:
            yield x**2
    g = __gen(iter(range(10)))
    print g.next()

    Only the outermost for-expression is evaluated immediately, the other
    expressions are deferred until the generator is run::

   g = (tgtexp  for var1 in exp1 if exp2 for var2 in exp3 if exp4)

is equivalent to::

def __gen(bound_exp):
    for var1 in bound_exp:
        if exp2:
            for var2 in exp3:
                if exp4:
                    yield tgtexp
g = __gen(iter(exp1))
del __gen
  1. The syntax requires that a generator expression always needs to be
    directly inside a set of parentheses and cannot have a comma on
    either side. With reference to the file Grammar/Grammar in CVS,
    two rules change:

    a) The rule::

      atom: '(' [testlist] ')'

    changes to::

      atom: '(' [testlist_gexp] ')'

    where testlist_gexp is almost the same as listmaker, but only
    allows a single test after ‘for’ … ‘in’::

      testlist_gexp: test ( gen_for | (',' test)* [','] )

    b) The rule for arglist needs similar changes.

    This means that you can write::

    sum(x**2 for x in range(10))

    but you would have to write::

    reduce(operator.add, (x**2 for x in range(10)))

    and also::

    g = (x**2 for x in range(10))

    i.e. if a function call has a single positional argument, it can be
    a generator expression without extra parentheses, but in all other
    cases you have to parenthesize it.

    The exact details were checked in to Grammar/Grammar version 1.49.

  2. The loop variable (if it is a simple variable or a tuple of simple
    variables) is not exposed to the surrounding function. This
    facilitates the implementation and makes typical use cases more
    reliable. In some future version of Python, list comprehensions
    will also hide the induction variable from the surrounding code
    (and, in Py2.4, warnings will be issued for code accessing the
    induction variable).

    For example::

    x = "hello"
    y = list(x for x in "abc")
    print x    # prints "hello", not "c"
  3. List comprehensions will remain unchanged. For example::

    [x for x in S]    # This is a list comprehension.
    [(x for x in S)]  # This is a list containing one generator
                      # expression.

    Unfortunately, there is currently a slight syntactic difference.
    The expression::

    [x for x in 1, 2, 3]

    is legal, meaning::

    [x for x in (1, 2, 3)]

    But generator expressions will not allow the former version::

    (x for x in 1, 2, 3)

    is illegal.

    The former list comprehension syntax will become illegal in Python
    3.0, and should be deprecated in Python 2.4 and beyond.

    List comprehensions also “leak” their loop variable into the
    surrounding scope. This will also change in Python 3.0, so that
    the semantic definition of a list comprehension in Python 3.0 will
    be equivalent to list(). Python 2.4 and
    beyond should issue a deprecation warning if a list comprehension’s
    loop variable has the same name as a variable used in the
    immediately surrounding scope.

Early Binding versus Late Binding

After much discussion, it was decided that the first (outermost)
for-expression should be evaluated immediately and that the remaining
expressions be evaluated when the generator is executed.

Asked to summarize the reasoning for binding the first expression,
Guido offered [5]_::

Consider sum(x for x in foo()). Now suppose there's a bug in foo()
that raises an exception, and a bug in sum() that raises an
exception before it starts iterating over its argument. Which
exception would you expect to see? I'd be surprised if the one in
sum() was raised rather the one in foo(), since the call to foo()
is part of the argument to sum(), and I expect arguments to be
processed before the function is called.

OTOH, in sum(bar(x) for x in foo()), where sum() and foo()
are bugfree, but bar() raises an exception, we have no choice but
to delay the call to bar() until sum() starts iterating -- that's
part of the contract of generators. (They do nothing until their
next() method is first called.)

Various use cases were proposed for binding all free variables when
the generator is defined. And some proponents felt that the resulting
expressions would be easier to understand and debug if bound immediately.

However, Python takes a late binding approach to lambda expressions and
has no precedent for automatic, early binding. It was felt that
introducing a new paradigm would unnecessarily introduce complexity.

After exploring many possibilities, a consensus emerged that binding
issues were hard to understand and that users should be strongly
encouraged to use generator expressions inside functions that consume
their arguments immediately. For more complex applications, full
generator definitions are always superior in terms of being obvious
about scope, lifetime, and binding [6]_.

Reduction Functions

The utility of generator expressions is greatly enhanced when combined
with reduction functions like sum(), min(), and max(). The heapq
module in Python 2.4 includes two new reduction functions: nlargest()
and nsmallest(). Both work well with generator expressions and keep
no more than n items in memory at one time.

Acknowledgements

References

.. [1] PEP 202 List Comprehensions
http://www.python.org/dev/peps/pep-0202/

.. [2] PEP 255 Simple Generators
http://www.python.org/dev/peps/pep-0255/

.. [3] Peter Norvig’s Accumulation Display Proposal
http://www.norvig.com/pyacc.html

.. [4] Jeff Epler had worked up a patch demonstrating
the previously proposed bracket and yield syntax
http://python.org/sf/795947

.. [5] Discussion over the relative merits of early versus late binding
https://mail.python.org/pipermail/python-dev/2004-April/044555.html

.. [6] Patch discussion and alternative patches on Source Forge
http://www.python.org/sf/872326

Copyright

This document has been placed in the public domain.