Lazy Evaluation of Haskell

This article is about lazy evaluation, an evaluation strategy, an operational semantics. This means lazy evaluation specifies the steps taken by a computer to do things. Lazy evaluation is a popular way to implement most of Haskell, but not the only way.

Haskell itself is not lazy; it only specifies non-strictness, not what steps a computer takes, for most things. Non-strictness is a word from denotational semantics, which is only about relating expressions to values. Non-strictness means for example

(⊥ is a theoretical value to stand for “no information” or “no answer”. I cannot say “non-termination” because that would require talking about steps.) In other words, non-strictness means that an expression can have an answer even if some of its arguments have no answer.

Denotational semantics says whether you get an answer and what answer, but does not explain or control the amount of time and memory used to get it. This is when you need the operational semantics given by compilers and interpreters; fortunately, most of them give lazy evaluation, coarsely speaking. Lazy evaluation suffices to explain asymptotic complexities such as why some code uses Θ(n) memory and why an alternative uses Θ(1) memory. Even such a coarse understanding without real cycle counts and byte counts already solves a lot of mysteries because most programmers do not even get this far.

If actual cycle counts and byte counts matter, if you need to explain why some code uses 40 MB memory and why an alternative uses just 10 MB, then lazy evaluation does not suffice, and you need the low-level operational semantics of your compiler or interpreter, including code optimizations. This is outside the scope of this article. But seeing that the low-level operational semantics is a refinement of lazy evaluation, you should probably know lazy evaluation first.

Knowing lazy evaluation does not help explain what answer you get, if that is all you want regardless of complexity. It can be done, and most people do it (out of poor education), but it is too low-level and tedious, missing the forest for the trees, nay, tree leaves. Denotational semantics is better, but still not high-level enough. Axiomatic semantics combined with program derivation is the most suitable, and I have triumphant examples of the axiomatic, derivational style: scanl examples, fix examples, a primitive recursion example.

Whirlwind Tour of Lazy Evaluation

The main points of lazy evaluation are pretty straightforward, just unfamiliar: short-circuit evaluation, and sharing the result (if the thing evaluated has multiple parents). This is of course vague; I illustrate below and fully clarify in subsequent sections.

To illustrate short-circuit evaluation, I evaluate (\c _ -> c) True (3*4+3*4) :

●(\c _ -> c) True (3*4+3*4)

→{function application}

True

□

In particular the expression 3*4+3*4 is left alone because the rule for function application doesn't evaluate it.

●case (\c _ -> c) (Just (3*4+3*4)) 4 of { Just _ -> True; Nothing -> False }

→{evaluate until match}

●(\c _ -> c) (Just (3*4+3*4)) 4

→{function application}

Just (3*4+3*4)

…case Just (3*4+3*4) of { Just _ -> True; Nothing -> False }

→{match pattern}

True

□

The sub-expression is evaluated only as far as to tell which pattern to match; 3*4+3*4 is again unevaluated.

But sometimes we are also interested in evaluating 3*4+3*4, for example if we expressedly want it done, or if the pattern above were Just 0. Here is how.

The operators (+) and (*) for built-in numeric types (Int, Integer, Float, Double) are eager, not lazy: to evaluate e+f (say), both e and f are evaluated first. Let us also pick the evaluation order so that we evaluate e, then f, then get back to finish e+f; similarly for e*f. I now evaluate 3*4+3*4 :

●3*4+3*4

→{first operand not done}

●3*4

→{both operands done, arithmetic}

12

…12+3*4

→{second operand not done}

●3*4

→{both operands done, arithmetic}

12

…12+12

→{both operands done, arithmetic}

24

□

●(\x->x+x) (3*4)

→{function application}

→{need first operand}

●3*4

→{both operands done, arithmetic}

12

…

→{both operands done, arithmetic}

24

□

By mentioning the same formal argument x multiple times, lazy evaluation requires sharing the actual argument 3*4 as well as everything that happens to it subsequently, such as it becoming 12. The same sharing applies to bindings in pattern-matching, let, where, and top-level definitions. (For example, let x=3*4 in x+x is almost identical to the above.) However, sharing does not apply to 3*4+3*4 per se, unless the compiler applies an optimization called common-subexpression elimination.

This whirlwind tour gives a glimpse of lazy evaluation but leaves the precise rules unsaid. These are spelled out in subsequent sections.

Notation: Inlined Expression Graph

The notation I use to depict expressions is designed to make explicit what is shared (and what is not shared); this is very important because it makes the whole world of difference in lazy evaluation. The notation also happens to be close to how most compilers and interpreters represent expressions.

A typical expression consists of mostly function applications and constructors, like let x=blah in f x : g x. In the extreme, I may draw a graph in which a node stands for each function application or constructor, with explicit pointers to functions and arguments, like this:

This extreme takes a lot of space and a lot of work. More often, I like to merge most nodes into one, drawing separate nodes for shared things only (in this example blah), like this:

Furthermore, I don't draw separate nodes for lambda-bound variables, though they may occur multiple times. So for example, I leave (\x -> f x : g x) as it is. Lambda-bound variables represent holes to be filled in, rather than pointers to more nodes.

I call these Inlined Expression Graphs: they are like expression graphs, but I inline (merge) unshared nodes to save space and work.

This notation is similar to that in the book Introduction to Functional Programming using Haskell, 2nd edition, by Richard Bird. I add boxes around nodes; I also have slightly more separate nodes.

There is a text-only notation in the paper A Natural Semantics for Lazy Evaluation, by John Launchbury. It is like using a huge let-expression to represent an expression graph, except that it does away with the keywords “let” and “in”. It binds names to nodes (except the root), so pointers can be serialized as target node names. It remains to write down the binding of names to node contents, and to distinguish the root. So for example, let x=3*4 in x+x is written as:

{x↦t*f, t↦3, f↦4} : x+x

(It uses the convention that every argument must be a separate node.)

Likewise, the book The Haskell School of Expression, by Paul Hudak, uses a text-only notation based on where-clauses and bindings when sharing is important, e.g., x+x where x=3*4. I used to use this notation when contributing to the Haskell Wiki a long time ago.

I now decide against a text-only notation because I think that most readers like to see pictures when sharing happens.

Notation: Structured Derivation

●stuff0

→{why}

stuff1

→{why}

stuff2

□

Sometimes, evaluation requires zooming into a sub-expression and processing it, before zooming out and continuing with the parent expression. I indicate this by an indented, nested evaluation, like this:

●stuff0 blah

→{why zoom in}

●blah

→{why}

yak

…stuff0 yak

→{why}

stuff1

□

This format is a fragment of Structured Derivation by Ralph-Johan Back. The most important feature I use here is nesting, which is what “structured” refers to. Most other presentations are unnested (including Bird's, Hudak's, all webpages before this one; fortunately, Launchbury's has nesting), i.e.,

stuff0 blah

→{why blah becomes yak}

stuff0 yak

→{why}

stuff1

A nested presentation is superior in several ways. It separates the two concerns—which rule triggers zooming into blah, which rule transforms blah—and explains them separately. It cuts the bloat and confusion of copying the parent expression over and over again when the sub-expression takes many steps to process. It is closer to what most compilers and interpreters do: when zooming in, one stack unit is reserved; when zooming out, one stack unit is freed. Thus, nesting depth is stack depth, and it completely explains stack overflows.

Function Application (without patterns)

The evaluation rule for function application: evaluate the function until it is a lambda term, then plug in the arguments. Multiple occurrences of a formal argument must all point to the same actual argument node (sharing).

●(\c -> (c, 0)) blah

→{function application}

(blah, 0)

□

●(\c -> (c, c)) blah

→{function application}

□

The above examples are easy because of two simplifications: the function is already a lambda term, and there is only one argument. The following example shows the first complication, where the function needs its own processing first; it uses an evaluation rule covered later.

●(if True then (\c -> (c, 0)) else (\c -> (c, 1))) blah

→{need function}

●(if True then (\c -> (c, 0)) else (\c -> (c, 1)))

→{if-then-else}

(\c -> (c, 0))

…(\c -> (c, 0)) blah

→{function application}

(blah, 0)

□

The second complication is about “multiple” arguments like f x y. Technically, this is (f x) y, i.e., a function application with one argument y, but the function is yet another function application.

●(\c _ -> c) True (3*4+3*4)

→{need function}

●(\c _ -> c) True

→{function application}

(\_ -> True)

…(\_ -> True) (3*4+3*4)

→{function application}

True

□

It is a hassle to add N-1 nesting levels for what can be casually read as 1 function and N arguments. In practice, I will shorten the evaluation steps to plugging in all N arguments at once. This shortcut is safe for asymptotic complexity because the Haskell type system limits the number of arguments below program size. So this shortcut does not mistake a Θ(n) computation for a Θ(1) computation; it only hides a constant factor. This shortcut is also taken by most compilers and interpreters.

●(\c _ -> c) True (3*4+3*4)

→{function application}

True

□

If the function is a name with a definition elsewhere, I have two choices: I could spend a step turning the name into the corresponding lambda term, and then plug in the arguments; or I could just jump to plugging in the arguments. So here is another shortcut: I will jump to plugging in the arguments. It is less clutter, especially when later I cover functions defined by pattern matching. It is safe for asymptotic complexity. It is also close to most compilers and interpreters.

●const True (3*4+3*4)

→{function application}

True

□

A later section covers function applications in which the functions are defined by pattern matching.

Built-in Numbers; Characters

The evaluation rule for comparison and arithmetic of built-in numeric types (Int, Integer, Float, Double): evaluate the operands (in operand order, usually) until they are final answers, then evaluate the expression in question until it is a final answer. For example: to evaluate e+f, evaluate e, then evaluate f, then evaluate e+f.

(For user-defined types, evaluation order and eagerness depend on implementation, since comparison and arithmetic for them are also user-defined via the type classes Eq, Ord, Enum, Num, etc. User-defined code can do anything it likes. There are well-known user-defined naturals with partly lazy comparison and fully lazy succ, for example.)

●3*4+3*4

→{first operand not done}

●3*4

→{both operands done, arithmetic}

12

…12+3*4

→{second operand not done}

●3*4

→{both operands done, arithmetic}

12

…12+12

→{both operands done, arithmetic}

24

□

●0*(3+4)

→{second operand not done}

●3+4

→{arithmetic}

7

…0*7

→{arithmetic}

0

□

The evaluation rule for most operations on Char (comparison, most Enum operations, conversion from/to Int) is similar: evaluate the operands (in operand order, usually) until they are final answers, then evaluate the expression in question until it is a final answer.

●succ 'g' == pred (pred 'j')

→{first operand not done}

●succ 'g'

→{Char succ}

'h'

…'h' == pred (pred 'j')

→{second operand not done}

●pred (pred 'j')

→{operand not done}

●pred 'j'

→{Char pred}

'i'

…pred 'i'

→{Char pred}

'h'

…'h' == 'h'

→{Char comparison}

True

□

Pattern Matching (Level 1) and Conditional

if-then-else can be regarded as a special case of pattern matching. I cover if-then-else first—it's easy.

The evaluation rule for if b then et else ef: evaluate b until it becomes True or False, then go with et or ef, respectively.

●if 2>2 then () else ()

→{need condition}

●2>2

→{arithmetic}

False

…if False then () else ()

→{if-then-else}

()

□

The evaluation rule for case e of {…}: evaluate the argument e enough to find the first pattern that match (and enough to see that it does not match those patterns before), then bind variables in the chosen pattern, and go with the chosen branch; or to find no match, then it's a runtime error.

There are a lot of variations in how much evaluation is enough and how much binding and sharing happens, depending on the patterns. In this section and several case studies shortly after, all patterns have at most one level, e.g., there are Just x and x:xs, but none nested like Left (Just x : (Nothing : ys)). There are also no guards. This is why the section title says “Level 1”; we shall accomplish higher levels later! So this section just focuses on level 1 patterns and sharing.

With level 1 patterns, the argument e is evaluated just until its outermost constructor is exposed, and then we know which branch to choose.

This is right for types defined by data, but not quite right for those by newtype. I will cover newtype in the future.

●case const (Just a) b of { Just _ -> True; Nothing -> False }

→{evaluate argument for pattern Just _}

●const (Just a) b

→{function application}

Just a

…case Just a of { Just _ -> True; Nothing -> False }

→{match pattern}

True

□

A level 1 pattern could also be a Char or numeric literal. Matching against it boils down to comparison using (==), i.e., the argument is evaluated as much as (==) specifies. Thus, for Char and built-in numbers, evaluate until the final answer; for user-defined numbers, it depends on user-defined (==).

●case 3*4 of { 0 -> True; _ -> False }

→{evaluate argument for pattern 0}

●3*4

→{arithmetic}

12

…case 12 of { 0 -> True; _ -> False }

→{match pattern}

False

□

There are also level 0 patterns: just a variable or the _ wildcard. The argument is unevaluated to match them.

●case 3*4 of { x -> True }

→{match pattern}

True

□

Variables in patterns are similar to lambda-bound variables: They are holes to be filled with pointers, and when you fill them in and go with the branch, multiple occurrences point to the same node (sharing).

●case Just (3*4) of { Just x -> (x,x); Nothing -> (0,0) }

→{match pattern}

□

●

case 

      of { v@(Just x) -> (x,v); Nothing -> undefined }

→{match pattern}

□

In the graph above, I draw some nodes explicitly to subtly indicate this process. We have a Just node, and it has pointer to a 3*4 node. We give the Just node to case, and that's what v receives. Whatever our Just node points to, that's what x receives. This explains why 3*4 is shared. The following example further emphasizes that v also shares existing nodes rather than pointing to new nodes.

●

→{match pattern}

□

Function Application (with patterns)

The evaluation rule for function application, where the function is defined elsewhere by pattern-matching with 1 argument: similar to a case expression, evaluate the argument enough to find the first match, then bind variables and go with the chosen RHS; or to find no match, then abort.

●null (iterate id 'c')

→{evaluate argument for pattern []}

●iterate id 'c'

→{function application}

'c' : iterate id 'c'

…null ('c' : iterate id 'c')

→{match pattern, function application}

False

□

The evaluation rule for function application, where the function takes several arguments and is defined by pattern-matching, is fairly delicate to describe because the order of evaluating the arguments is delicate.

Presumably, the function definition consists of a sequence of equations. At function application, the equations are tried in the sequence order.

An important aspect is that arguments could be more evaluated than the chosen equation requires because there are equations before it, and they have more specifc patterns or more demanding guards, and we need that much evaluation to decide to skip them.

●f [] (r 'y')

→{match pattern, function application}

0

□

●f (r 'y') []

→{evaluate 1st argument for pattern []}

●r 'y'

→{function application}

'y' : r 'y'

…f ('y' : r 'y') []

→{match pattern, function application}

1

□

●g (r 'x') (r 'y')

→{evaluate arguments for 1st g equation}

●r 'x'

→{function application}

'x' : r 'x'

●r 'y'

→{function application}

'y' : r 'y'

●'x' == 'y'

→{Char equality}

False

…g ('x' : r 'x') ('y' : r 'y')

→{match pattern, function application}

1

□

Function definitions that not only have patterns and guards but also where bindings can have very elaborate sharing structures. I will show them in the future when I find a good notation.

Cost Accounting

The most important purpose of knowing lazy evaluation is to explain asymptotic complexity, especially exhaustions such as stack overflow and heap leak.

Asymptotic stack complexity is the depth of nesting in the evaluation sequence (how many times I zoom into sub-expressions).

Asymptotic heap complexity can be gleaned from expression sizes in the evaluation sequence, but it may take some work when there are zoom-in steps.

When there is no zooming in, every step has the complete expression graph, and so the graph size is heap size. Of course, I inline some nodes, so count both node numbers and string lengths to be sure. Richard Bird's book has another good way to count: count actual arguments. All are asymptotically accurate counts, differing in constant overheads only.

When there is zooming in, in some cases none of the steps shows the complete expression graph at the time. So you have to imagine a flattened presetation for the complete graph. Here is an example excerpted from the foldr case study later:

●f (1:2:3:[])

→{match pattern, function application}

1 + f (2:3:[])

→{need right operand}

●f (2:3:[])

→{match pattern, function application}

2 + f (3:[])

→{need right operand}

●f (3:[])

→{match pattern, function application}

3 + f []

→{need right operand}

●f []

→{match pattern, function application}

0

At this point, the complete graph is 1 + (2 + (3 + 0)). The total number of actual arguments is 6 (3 binary operators). In the foldr case study, if you start with a list of length n, then in the middle the complete graph has 2n actual arguments, i.e., Θ(n) heap space.

Case Study: foldr

It is a good exercise to sum numbers (and do other tasks) with foldr, though we know it is fairly expensive.

●foldr (+) 0 (1:2:3:[])

→{match pattern, function application}

1 + foldr (+) 0 (2:3:[])

→{need right operand}

●foldr (+) 0 (2:3:[])

→{match pattern, function application}

2 + foldr (+) 0 (3:[])

→{need right operand}

●foldr (+) 0 (3:[])

→{match pattern, function application}

3 + foldr (+) 0 []

→{need right operand}

●foldr (+) 0 []

→{match pattern, function application}

0

…3 + 0

→{arithmetic}

3

…2 + 3

→{arithmetic}

5

…1 + 5

→{arithmetic}

6

□

This takes Θ(n) stack space; a long list will cause a stack overflow. Moreover, at the darkest moment, the whole graph for 1+(2+(3+0)) exists in the heap. This takes Θ(n) heap space.

●foldr (||) False (True:False:False:[])

→{match pattern, function application}

True || foldr (||) False (False:False:[])

→{match pattern, function application}

True

□

Case Study: foldl

You have seen how foldr fails at summing numbers. Lest you think foldl is any better:

●foldl (+) 0 (1:2:3:[])

→{match pattern, function application}

foldl (+) (0+1) (2:3:[])

→{match pattern, function application}

foldl (+) ((0+1)+2) (3:[])

→{match pattern, function application}

foldl (+) (((0+1)+2)+3) []

→{match pattern, function application}

((0+1)+2)+3

→{need left operand}

●(0+1)+2

→{need left operand}

●0+1

→{arithmetic}

1

…1+2

→{arithmetic}

3

…3+3

→{arithmetic}

6

□

This uses Θ(n) stack space and Θ(n) heap space. So-called “tail call” is utterly irrelevant. After reading this example, please do me a favour: never ask about “tail call” again.

A better way to sum numbers is covered in the future, involving eagerness annotations.

Case Study: Average

Even if we have an efficient function for summing—call it goodsum—the next surprise comes when computing averages using goodsum 0 xs / length xs.

goodsum 5 (10:xs)

→{match pattern, function application}

goodsum (5+10) xs

→{magic}

goodsum 15 xs

etc

●average (genlist 3)

→{function application}

→{evaluate left operand}

●goodsum 0 (genlist 3)

→{evaluate operand for pattern}

●genlist 3

→{match pattern, function application}

3 : genlist (3-1)

…goodsum 0 (3 : genlist (3-1))

→{magic}

goodsum 3 (genlist (3-1))

→{evaluate operand for pattern}

●genlist (3-1)

→{evaluate operand, match pattern, function application}

2 : genlist (2-1)

…goodsum 3 (2 : genlist (2-1))

→{magic}

goodsum 5 (genlist (2-1))

→{evaluate operand for pattern}

●genlist (2-1)

→{evaluate operand, match pattern, function application}

1 : genlist (1-1)

…goodsum 5 (1 : genlist (1-1))

→{magic}

goodsum 6 (genlist (1-1))

→{evaluate operand for pattern}

●genlist (1-1)

→{evaluate operand, match pattern, function application}

[]

…goodsum 6 []

→{match pattern, function application}

6

…

etc

The effect is that due to sharing, although goodsum tries to forget pieces of the list as it goes, some other part of the parent expression has a pointer to the list and keeps it. Before goodsum begins, the pointer points to just a short genlist 3; when goodsum finishes, the pointer points to a long list, which occupies Θ(n) heap space. Some people consider it a heap leak.

●average (genlist 3)

→{function application}

→{evaluate genlist}

→{evaluate goodsum}

→{evaluate genlist}

→{evaluate goodsum}

→{evaluate genlist}

→{evaluate goodsum}

→{evaluate genlist}

→{evaluate goodsum}

etc

Case Study: Co-Routines

Lest you think all lazy evaluation does is taking up a lot of memory, the following examples show how to use lazy evaluation properly to obtain compositional, asymptotically optimal programs.

goodsum 5 (10:xs)

→{match pattern, function application}

goodsum (5+10) xs

→{magic}

goodsum 15 xs

etc

●goodsum 0 (genlist 3)

→{evaluate operand for pattern}

●genlist 3

→{match pattern, function application}

3 : genlist (3-1)

…goodsum 0 (3 : genlist (3-1))

→{magic}

goodsum 3 (genlist (3-1))

→{evaluate operand for pattern}

●genlist (3-1)

→{evaluate operand, match pattern, function application}

2 : genlist (2-1)

…goodsum 3 (2 : genlist (2-1))

→{magic}

goodsum 5 (genlist (2-1))

→{evaluate operand for pattern}

●genlist (2-1)

→{evaluate operand, match pattern, function application}

1 : genlist (1-1)

…goodsum 5 (1 : genlist (1-1))

→{magic}

goodsum 6 (genlist (1-1))

→{evaluate operand for pattern}

●genlist (1-1)

→{evaluate operand, match pattern, function application}

[]

…goodsum 6 []

→{match pattern, function application}

6

For a while, it goes like the Average example. The important point this time is that, as no one else holds a pointer to the generated list, its cells are short-lived, like virtual particles. This program takes Θ(1) memory. Some compilers even optimize away the short-lived cells.

This shows an important idiom of using lazy evaluation: We have a producer, genlist 3, and we have a consumer, goodsum 0, and when put together, they behave like a pair of co-routines, passing control to the producer for one cell, then passing control to the consumer to use and lose that cell, then passing control to the producer for the next cell… We say that in Haskell we use lists as our for-loops.

The above example still does not do justice to the idiom. The idiom really shines when the consumer stops consuming early for whatever reason it feels like. The producer is not burdened with the stopping condition of the consumer. We get a cleaner separation. To be continued.

Lazy Evaluation of Haskell

Preface: A Note on Semantics

Whirlwind Tour of Lazy Evaluation

Notation: Inlined Expression Graph

Notation: Structured Derivation

Function Application (without patterns)

Built-in Numbers; Characters

Pattern Matching (Level 1) and Conditional

Function Application (with patterns)

Cost Accounting

Case Study: foldr

Case Study: foldl

Case Study: Average

Case Study: Co-Routines