the-python-programming-language-logo

Generator Patterns and Scalable Composability

Here’s a little generator function:

Example
def matching_lines_from_file(path, pattern):
with open(path) as handle:
for line in handle:
if pattern in line:
yield line.rstrip(‘n’)

matching_lines_from_file() demonstrates several important practices for modern Python and is worth studying. It does simple substring matching on lines of a text file, yielding lines containing that substring.

The first line opens a read-only file object, called handle. If you haven’t been opening your file objects using with statements, start today. The main benefit is that once the with block is exited, the file object is automatically closed – even if an exception causes a premature exit. It’s similar to:

Ex.
try:
handle = open(path)
# read from handle
finally:
handle.close()

(The try/finally is explained in the exceptions chapter.) Next we have for line in handle. This useful idiom, which not many people seem to know about, is a special case for text files. Each iteration through the for loop, a new line of text will be read from the underlying text file, and placed in the line variable.

Sometimes people foolishly take another approach, which I have to warn you about:

Ex.
# Don’t do this!!
for line in handle.readlines():

.readlines() (plural) reads in the entire file, parses it into lines, and returns a list of strings – one string per line. By now, you realize how this destroys the generator function’s scalability.

Another approach you will sometimes see, which is scalable, is to use the file object’s .readline() method (singular), which manually returns lines one at a time:

Ex.
# .readlines() reads and returns a single line of text,
# or returns the empty string at the end-of-file.
line = handle.readlines()
while line != ”:
# do something with
# …
# At the end of the loop, read the next line.
line = handle.readline()

But simply writing for line in handle is clearer and easier.

After that, it’s straightforward: matching lines have any trailing n-character stripped, and are yielded to the consumer. When writing generator functions, you want to ask yourself “what is the maximum memory footprint of this function, and how can I minimize it?” You can think of scalability as inversely proportional to this footprint. For matching_lines_from_file(), it will be about equal to the size of the longest line in the text file. So it’s appropriate for the typical human-readable text file, whose lines are small.

(It’s also possible to point it, to say, a ten-terabyte text file consisting of exactly one line. If you expect something like that, you’ll need a different approach.)

Now, suppose a log file contains lines like this:

Ex.
WARNING: Disk usage exceeding 85%
DEBUG: User ‘tinytim’ upgraded to Pro version
INFO: Sent email campaign, completed normally
WARNING: Almost out of beer

… and you exercise matching_lines_from_file() like so:

Ex.
for line in matching_lines_from_file(“log.txt”, “WARNING:”):
print(line)

That yields these records:

Ex.
WARNING: Disk usage exceeding 85%
WARNING: Almost out of beer

Suppose your application needs that data in dict form:

Ex.
{“level”: “WARNING”, “message”: “Disk usage exceeding 85%”}
{“level”: “DEBUG”, “message”:
“User ‘tinytim’ upgraded to Pro version”}

We want to scalably transform the records from one form to another – from strings (lines of the log file), to Python dicts. So let’s make a new generator function to connect them:

Ex.
def parse_log_records(lines):
for line in lines:
level, message = line.split(“: “, 1)
yield {“level”: level, “message”: message}

Now we can connect the two:

Ex.
# log_lines is a generator object
log_lines = matching_lines_from_file(“log.txt”, “WARNING:”)
for record in parse_log_records(log_lines):
# record is a dict
print(record)

Of course, parse_log_records() can be used on its own:

Ex.
with open(“log.txt”) as handle:
for record in parse_log_records(handle):
print(record)

matching_lines_from_file() and parse_log_records() are like building blocks. Properly designed, they can be used to build different data processing streams. I call this scalable composability. It goes beyond designing composable functions and types. Ask yourself how you can make the components scalable, and whatever is assembled out of them scalable too.

Let’s discuss a particular design point. Both matching_lines_from_file() and parse_log_records() produce an iterator. (Or, more specifically, a generator object). But they have a discrepancy on the input side: parse_log_records() accepts an iterator, but matching_lines_from_file() requires a path to a file to read from. This means matching_lines_from_file() combines two functions: read lines of text from a file, then filter those lines based on some criteria.

Combining functions like this is often what you want in realistic code. But when designing components to flexibly compose together, inconsistent interfaces like this can be limiting. Let’s break up the services in matching_line_from_file() into two generator functions:

Ex.
def lines_from_file(path)
with open(path) as handle:
for line in handle:
yield line.rstrip(‘n’)

def matching_lines(lines, pattern):
for line in lines:
if pattern in line:
yield line

You can compose these like so:

Ex.
lines = lines_from_file(“log.txt”)
matching = matching_lines(lines, ‘WARNING:’)
for line in matching:
print(line)

Or even redefine matching_lines_from_file() in terms of them:

Ex.
def matching_lines_from_file(pattern, path):
lines = lines_from_file(path)
matching = matching_lines(lines, pattern)
for line in matching:
yield line

Conceptually, this factored-out matching_lines does a filtering operation; all lines are read in, and a subset are yielded. parse_log_records() is different. One input record (a str line) maps to exactly one output record (a dict). Mathematically, it’s a mapping operation. Think of it as a transformer or adapter. lines_from_file() is in a third category; instead of taking a stream as input, it takes a completely different parameter. Since it still returns an iterator of records, think of it as a source. And any real program will eventually want to do something with that stream, consuming it without producing another iterator; call that a sink.

You need all these pieces to make a working program. When designing a chainable set of generator functions like this – or even better, a toolkit for constructing internal data pipelines – ask yourself whether each component is a sink, a source, or whether it does filtering, or mapping; or whether it’s some combination of these. Just asking yourself this question leads to a more usable, readable, and maintainable codebase. And if you’re making a library which others will use, you’re more likely to end up with a toolkit so powerfully flexible, people use it to build programs you never imagined.

I want you to notice something about parse_log_records(). As I said, it fits in the “mapping” category. And notice its mapping is one-to-one: one line of text becomes one dictionary. In other words, each record in the input – a str – becomes exactly one record in the output – a dict.

That isn’t always the case. Sometimes, your generator function needs to consume several input records to create one output record. Or the opposite: one input record yields several output records.

Here’s an example of the latter. Imagine a text file containing lines in a poem:

Ex.
all night our room was outer-walled with rain
drops fell and flattened on the tin roof
and rang like little disks of metal

Let’s create a generator function, words_in_text(), producing the words one at a time. First approach:

Ex.
# lines is an iterator of text file lines,
# e.g. returned by lines_from_file()
def words_in_text(lines):
for line in lines:
for word in line.split():
yield word

This generator function takes a fan out approach. No input records are dropped, which means it doesn’t do any filtering; it’s still purely in the “mapping” category of generator functions. But the mapping isn’t one to one. Rather, one input record produces one or more output records. So, when you run the following code:

Ex.
poem_lines = lines_from_file(“poem.txt”)
poem_words = words_in_text(poem_lines)
for word in poem_words:
print(word)

… it produces this output:

Ex.
all
night
our
room
was
outer-walled

That first input record – “all night our room was outer-walled with rain” – yields eight words (output records). Ignoring any blank lines in the poem, every line of prose will produce at least one – probably several – words.

The idea of fanning out is interesting, but simple enough. It’s more complex when we go to the opposite direction: fanning in. That means the generator function consumes more than one input record to produce each output record. Doing this successfully requires an awareness of the input’s structure, and you’ll typically need to encode some simple parsing logic.

Imagine a text file containing residential house sale data. Each record is a set of key-value pairs, one pair per line, with records separated by blank lines:

Ex.
address: 1423 99th Ave
square_feet: 1705
price_usd: 340210

address: 24257 Pueblo Dr
square_feet: 2305
price_usd: 170210

address: 127 Cochran
square_feet: 2068
price_usd: 320500

To read this data into a form usable in our code, what we want is a generator function – let’s name it house_records() – which accepts a sequence of strings (lines) and parses them into convenient dictionaries:

Ex.
>>> lines_of_house_data = lines_from_file(“housedata.txt”)
>>> houses = house_records(lines_of_house_data)
>>> # Fetch the first record.
… house = next(houses)
>>> house[‘address’]
‘1423 99th Ave’
>>> house = next(houses)
>>> house[‘address’]
‘24257 Pueblo Dr’

How would you create this? If practical, pause here, open up a code editor, and see if you can implement it.

Okay, time’s up. Here is one approach:

Ex.
def house_records(lines):
record = {}
for line in lines:
if line == ”:
yield record
record = {}
continue
key, value = line.split(‘: ‘, 1)
record[key] = value
yield record

Notice where the yield keywords appear. The last line of the for loop reads individual key-value pairs. Starting with an empty record dictionary, it’s populated with data until lines produces an empty line. That signals the current record is complete, so it’s yield-ed, and a new record dictionary created. The end of the very last record is housedata.txt is signaled not by an empty line, but by the end of the file; that’s why we need the final yield statement.

As defined, house_records() is a bit clunky if we’re normally reading from a text file. It makes sense to define a new generator function accepting just the path to the file:

Ex.
def house_records_from_file(path):
lines = lines_from_file(path)
for house in house_records(lines):
yield house

# Then in your program:
for house in house_records_from_file(“housedata.txt”):
print(house[“address’])

You may have noticed many of these examples have a bit of boilerplate, when one generator function internally calls another. The last two lines of house_records_from_file say:

Ex.
for house in house_records(lines):
yield house

Python3 provides a shortcut, which lets you accomplish this in one line, with the yield from statement:

Ex.
def house_records_from_file(path):
lines = lines_from_file(path)
yield from house_records(lines)

Even though “yield from” is two words, semantically it’s like a single keyword, and distinct from yield. The yield from statement is used specifically in generator functions, when they yield values directly from another generator object (or, equivalently, by calling another generator function). Using it often simplifies your code, as you see in house_records_from_file(). Going back a bit, here’s how it works with matching_lines_from_file():

Ex.
def matching_lines_from_file(pattern path):
lines = lines_from_file(“log.txt”)
yield from matching_lines(lines, ‘WARNING:’)

The formal name for what yield from does is “delegating to a sub-generator”, and instills a deeper connection between the containing and inner generator objects. In particular, generator objects have certain methods – send, throw and close – for passing information back into the context of the running generator function. I won’t cover them in here, because they are not currently widely used; you can learn more by reading PEPs 343 and 380. If you do use them, yield from becomes necessary to enable the flow of information back into the scope of the running coroutine.