Goldilocks Rectangle - finding a better solution

There are well known algorithms for many problems, but these are usually presented in an imperative fashion, which can often be difficult to translate to a functional language and immutable data-structures. Here we will look at three ways to solve a classic problem, looking at what it means to implement different solutions in a function style, using Haskell, and how that affects the code we write. Like Goldilocks and her quest for the perfect porridge, I will suggest that sometimes the ideal solution is actually a compromise between extremes.

picture credit:
picture credit:

The problem

The classic problem here is finding the largest rectangular area under a histogram, e.g.:

Jkv, Public Domain,
Jkv, Public Domain,

In this histogram, which we might represent as the following sequence of bar heights [1,2,5,12,10,19,27,14,6,4,0,1], the greatest rectangular area is 50, which is the area shown below:

histogram with box
histogram with box

The reason that this is an interesting problem is that there are many ways to approach it, with greatly varying performance characteristics, ranging from O(n²) through O(n.log n) to O(n), and they involve different approaches and techniques to achieve.

For the purposes of the problem, the bars are assumed to all be 1 unit wide, and all have a non-negative height.

Preamble and imports

We will be making use of the following imports in general:

Others will be discussed where required.

Solution 1.

For the first solution, let’s just think about taking a horizontal slice across the histogram at different heights:

histogram with slice at 12 units
histogram with slice at 12 units

If we slice across the histogram at the height of the 4th bar (12 units high), we get two sections, one with 1 bar in it, and another with 3. We don’t know at this stage how far up each bar extends, but we do know that they all extend down towards 0. This means there are two rectangular regions of height 12 we can draw, one of width 1, and one of width 3:

histogram with regions of height 12
histogram with regions of height 12

While we could consider every possible bar height, including values such as 11 which no bar is precisely, we really only need to consider the heights of the bars themselves as places to make our slices. So we can proceed as follows:

For each bar:

  • Slice the histogram at that height
  • Multiply the size of each slice by the height of the bar

And then take the maximum value.

You can view an animated version of this on observablehq.

Taking a slice across the histogram could look like this:

Lets look at what we get when we slice our histogram at the height of 12:

Great, we get either voids (groups of False) or intersections (groups of True) taken as a cross section across the histogram. Now all we need to do is ignore the voids and find the largest group, and then multiply the size of that by the height:

Again, we can see what this tells us the height of 12:

And this leads us to our first definition:

There are a couple of things to observe here. One is that this approach uses several partial functions (specifically head and maximum), and care must be taken to ensure that they are safe. The use of head is safe, because is guaranteed to never produce empty groups. If there are no groups, then we get an empty set of groups instead. The use of maximum in maxRectAt is safe because we are scanning the input list itself, which by definition includes one bar at least which matches our criterion (>= h). The use of maximum in solution_1 itself would error if we passed in an empty list of bars, so we instead we ensure that the list passed to maximum will have at least one value, 0, at its head. A downside of this approach is that we need to reassure ourselves of this safety, and the compiler cannot help us here.

Secondly it is nice to see how clean and clear this solution is. It uses standard list processing functions such as fmap, filter and group, and being able to express the rectangular area calculation in a single composed expression is satisfying.

There is a further optimisation here we can make use of - since two bars of the same height will produce the same maximum, we can nub the input to reduce this duplicated work - nubOrd from containers is a great choice here:

Unfortunately, while elegant, this is the naive O(n²) solution (nubbing notwithstanding). For each bar we consider, we scan the whole list to calculate the area at that height.

A Small improvement

What about instead of scanning the whole list we just scanned the relevant bits, i.e. the sections in-front of and behind the bar itself. We only need to look in either direction until we find a bar lower than the current one. For this we can make use of two little functions from Data.List: inits and tails. These produce all the prefixes before an index and all the suffixes after the index. We can imagine that inits can be defined as [take n xs | n <- [0 .. length xs]], and tails can be defined as [drop n xs | n <- [0 .. length xs]]. This means we will generally have to scan much less of the input, and perhaps produce a more efficient solution:

This will, for most realistic input, have much better amortised performance than the naive solution, but it is technically still O(n²), since for each input value we need to scan the tail and the prefix, and in the pathological case where every value is the same, then we still need to scan every value for every input, and on top of that, we also need to need to reverse the list.

Solution 2.

Let’s look instead at a well-known O(n) solution, that only considers each bar once, and uses a stack to manage the look-ahead.

The idea here is to keep a stack of all the bars we have seen for which we have not yet seen a matching end-bar. As soon as a section is terminated (by encountering a bar that is smaller than the previous one) we can remove that from the stack. This means that the stack is kept in ascending order, and the item before the top of the stack represents the left-edge of the region.

You can see an animation of this algorithm here

This algorithm is fundamentally stateful, requiring the concept of a stack of bars (including their indices) as well as the concept of a current index, which may not be the bar we are considering, especially when we come to pop the stack. While we could use the State monad to model this, stateful loops can be expressed more simply with a tail-recursive function, which should compile down to the same tight loop you would see in a more imperative language:

Here we have a single worker function, go which is called in tail-recursive position so long as there are values in either the input or the stack. Like a lot of such functions, keeping the types straight in your head is half the difficulty, so here a signature has been given to the worker function that helps identify what all the various Ints actually mean in this context. Being able to give these meaningful names is more helpful to the poor human reader than saying the stack is of type [(Int, Int)].

Mapping thought to action

We can see immediately that there are three cases:

  1. There is a new unconsidered input value and it is at least the height of whatever is on top of the stack (which we consider true if the stack is empty)
  2. The stack still has unprocessed values in it, which means that we have reached the right-hand-edge for whatever region the bar on the top of the stack is in. If the input list is empty, then we have reached the right-hand-edge for all the values remaining in the stack.
  3. There are neither any new values nor any unprocessed values in the stack, so we have nothing to do, and we can return the largest value we have seen so far.

What immediately strikes me about this implementation is how clearly it accords with the description of the algorithm - the algorithm has two conditions, and our function has two bodies with pattern guards. We can see from a brief inspection that at any point we are either advancing and stacking, or popping and calculating, and that we alternate between the two. Also, this implementation does not include any partial functions, and by making use of lists as stacks, and pattern guards, the compiler keeps us honest about handling the various cases.

What is less desirable here is how fiddly this is to get right. While this looks tolerably clean here, translating an imperative procedure to a functional style requires rethinking how it works. It can be satisfying though, and can even be said that this is more concise and even intelligible than some of the standard imperative presentations (especially since they tend to duplicate the handling of the two conditions when we need to pop the stack).

I think it is fair to say though that this is not exactly an idiomatic solution, and apart from its efficiency, I’m not sure it can exactly said to be very nice to read. One of the problems with recursion is the cognitive burden it places on the reader, and it has been compared to goto in terms of its capacity to create cryptic code. The next solution will seek to strike a fair balance between efficiency and clean code.

Solution 3.

The final solution is rests on the observation that we can always make a rectangular area of the height of the minimum value in the list of bars, with the width equal to the length of the list of bars. This may not the largest rectangle, but if it isn’t that, then it must be taken solely from the bars to the left or to the right of the minimum, which functions like a valley dividing the two hills. And in each such hill, there is a similar rectangular area - i.e. this is a recursive algorithm where we search through successively smaller subsections of the input, considering the minimum value each time and splitting the list at that point:

histogram with three sections
histogram with three sections

In the picture you can see a step in this process - either the maximum rectangular area is the wide section at the bottom with the height of the smallest bar, or it is contained within the two boxes left and right of that bar. After calculating the height of the bar at the bottom, we can proceed to apply the same process to the two sections to either side:

divide and conquer
divide and conquer

You can view an animated version of this on observablehq

These kinds of approaches translate very naturally to functional languages:

Nicer! This certainly looks a lot cleaner. Like the stack based solution, it is clearly a translation into code of the description of the solution - with handling of the left-hand-side, minimum value and right-hand-side of the inputs. We have a partial function, but we are explicitly passing it three values at a time, so that is not a problem.

What is more of an issue is actually the fact that most of our work is actually going to be in the splitAtMin function, which needs to rescan the input each time we recurse in order to find the new minimum. On the plus side we have increasingly fewer items to scan to find the next minimum, but still, that is going to involve a lot of repeated work, since finding the very first minimum also implies finding the next ones. It would be nicer if we could perform all those comparisons once and cache the results for later reuse.

Better performance through data-structures

One way to do this is to build up a search tree once that helps us quickly locate the minimum value, while preserving the sequence. An example of such a data structure is a so-called ‘segment tree’, where each node stores the minimum value of all the nodes beneath it. Our tree will also store the size of the trees below it, so that we can have O(1) sizes. Happily, such trees are easy to define in Haskell:

The trick to making this work nicely is to consider what happens when be smush two trees together to make a new composite tree. We want that new tree to contain a summary of their minimum value and their total size, and the concept of smushing together can be understood as the (<>) method of the Semigroup type-class:

Here we discard empty trees when combining them with anything else, and from tips and branches we build new tips, that in turn contain the global minimum values and the sum of the sizes. So we need some way to get the value and the size:

And being able to turn it back into a list would be nice:

As a next step, we need a way to build such a tree from a list of values. Since trees are Monoids, we can do this very simply with foldMap Tip, which wraps each element in a Tip, and then mappends them all together. However, this happens to build very unbalanced trees. For example, if we build a tree in this way, we can see it is right-biased:

Which builds the tree with tips on the left, and a long spine on the right:

      |                 |
  Tip = 1         [min=2,len=3]
               |                 |
           Tip = 2         [min=3,len=2]
                    Tip = 3           Tip = 4

That’s the same structure as a list! If the minimum is not at the head, we are going to have to traverse a long section of the spine to find it, making it no better than a list for random access (it is better when the minimum is near the front of the list, since we still don’t have to traverse the whole list to find it).

Since a balanced tree would be much nicer, in that it would produce more consistent results, a quick way to do that is with the treeFold operation (from Jon Fairbairn), which builds up the new folded structure without being purely left or right associative:

To see what this looks like, we can play around with this a bit:

Much nicer than either foldr or foldl, and much more likely to give is the logn search time we are going to be hoping for. Tree-fold cannot change the type of its input (unlike fold(r/l)), since it doesn’t know what direction it is coming from, but that means it works well for anything that obeys the Semigroup laws.

This allows us to build up a segment-tree easily from a list of inputs:

And it produces much more balanced trees, which becomes clear as soon as we indent the printed representation.

All that is remaining here is to define a way to split the tree at its (left-most) minimum point:

The only real complexity here is deciding whether to descend into the LHS or the RHS as we split the tree, depending on where the minimum is to be found. We also need to remember to keep the other side around by associating it on the other side as we return.

All these pieces let us write a divide-and-conquer function that performs the minimum search once, and where each split takes O(logn) time:

Happily, the only thing we really need to change here is how we split the input, and exchanging length for size - if we implement Foldable we can even use Foldable.length and this becomes more generic.

And there we have it - a clear, declarative solution, that abstracts the general portions of the implementation to a reusable data-structure rather than burying it in the approach itself. You can see that the complexity is defined by building up the search tree, then performing N splits, which each take O(logn) operations. We also get to benefit from O(1) size operations when calculating the base-case, which is nice.

Summing it Up

While the stack based solution is the right approach if we absolutely must optimise for speed, it suffers in terms of maintainability and clarity. It also doesn’t provide anything that can be re-used as is elsewhere, while the tree-based solution leads us to an interesting general purpose data-structure. n.logn solutions are often more than good enough, and the benefit from clarity seems to me to make this the Goldilocks option in this case.


A generic divide and conquer

What if we want to support multiple search trees? We we could define that as a typeclass. Since we

Then we can implement this interface for lists and SegmentTrees:

This requires MultiParamTypeClasses and FlexibleInstances however, but these are fairly standard extensions.

We are also going to need to make SegmentTrees instances of Foldable:

Then we can define a generic version of divideAndConq:

Pulling your finger(-tree) out

Attentive readers (if there are any) may have noticed that the SegmentTree presented here is just a special case of an even more general purpose data-structure known as a FingerTree. Here we will reimplement a SegmentTree in terms of a FingerTree and see how that compares:

First we will need to import it from the fingertree package:

The primary concept of the FingerTree is the idea of a measure, which associates the values with a datum, which is cached at the nodes. This value must be a Monoid so that we can freely combine trees together. In our case this is a combination of the minimum value and the sum of the number of items in the tree. We need to define two newtypes, one for the measure, and one for the value:

Since the tuple of two monoids is a monoid, we can just derive all the instances we need here. Next we need to define our SegmentFingerTree, which requires associating Ms and Vs:

The measure for a value is just itself in a Min wrapper and the count of 1, and the SegmentFingerTree is just an alias for the underlying FingerTree.

We now need to implement the two fundamental operations - building the tree up and tearing it down, and a function for getting the size of a tree.

Building a tree is just a thin wrapper around fromList itself that just wraps each value in a V newtype. Splitting makes use of the fact that measure, when called on the tree retrieves the cached value of that node, and we can then use the split function to find the first place that minimum value is found. Otherwise apart from a lot of newtype wrapping/unwrapping, and the fact we don’t need to do any special bookkeeping, it is identical to the split we defined for SegmentTree.

Now we have all the pieces to define divideAndConq in terms of a finger-tree:

We can see that this is basically the same, but uses the more general data-structure of the finger-tree underneath. If we compare what we had to write to make this work, we can see we only needed to define the bits that mattered to our domain - how to split at tree at the minimum value, how to get the cached size. Building the tree itself, and all other operations such as the ability to efficiently split and search it are gained simply by specifying general operations around new-types, most of which the compiler generated for us. This freed us up to focus on just the relevant sections of the code, which is very small and clear indeed. The handwritten SegmentTree is more efficient (as the benchmarks show), but this was much less work to write.


Talking about performance without numbers is a fool’s errand. So we need to get some numbers to back up our claims here. All the code for these implementations (and a couple of other minor variations) is available [here][code], and an example benchmark run is available [here][benchmark]. The benchmarks are not super thorough, and are run on a cyclic input of either 1,000 elements, or 10,000 elements. The results are enlightening though:

Approach Complexity Small (1k) Big (10k) Big Sine
naive O(n²) 56.01 ms 6.033 s 5.646 s
naive with nub O(n²) 633.0 μs 6.768 ms 1.115 s
inits-and-tails O(n²) 5.808 ms 703.6 ms 770.8 ms
recursive - minimum with span O(n²) 465.4 μs 25.80 ms 531.1 ms
recursive - custom tree, treemap O(n.logn) 332.3 μs 7.296 ms 16.42 ms
recursive - custom tree, foldmap O(n.logn) 320.1 μs 5.944 ms 1.829 s
recursive - finger tree O(n.logn) 1.530 ms 19.00 ms 27.70 ms
stack O(n) 30.89 μs 317.0 μs 334.2 μs

There are a lot of take-aways from this set of benchmarks, and not all of it was immediately obvious - at least not to me.

  • Input matters

    The big and big sine inputs have the same number of data-points, but differ in their shape, wave-length and period. For some approaches this has very little effect (the naive approach, and the stack-based approach are not sensitive to this), but the other approaches are significantly affected, although to quite different degrees. Coping with typical and pathological/friendly input makes a big difference.

  • The basic solution can be fast enough

    All the approaches are just fine for even moderately sized inputs - sweating the small performance gains is just not going to be worth it.

  • Simple optimisation can produce massive wins

    Just nubbing the input makes a massive performance improvement for the naive approach. While this algorithm is still much slower than the others, for the speed of development, it remains pretty acceptable up to even large inputs - it is 1,000 times faster than the un-nubbed version, and completes the big data-set faster than some of the smarter approaches. This advantage disappears when faced with a less friendly data-set though.

  • Choosing the right fold can make a world of difference

    When the next minimum is close to the left-hand-side of the data-set, foldMap works fine for producing a good search tree. However, when the data-set is less friendly, treeMap really shines, avoiding the spectacular blowup faced by the foldMaped tree with the sine wave.

  • Don’t be afraid to reimplement

    Finger-trees are awesome general purpose data-structures, and are capable of implementing a wide variety of different things. They are seldom the best in class though - there are faster interval maps, faster heaps, and as you can see here, faster segment trees there to be used. Just as interesting is the small amount of code that was needed to produce something that out-performed finger-trees.

  • You just can beat linear scans

    If you need the best, you just can’t beat a low-level linear scan. But as good as it is, the divide-and-conquer approaches are more than fast enough for most people’s needs, if you cannot find a linear scan that works for you.

picture credits:

  • open clipart
  • Wikimedia commons