chap07 3 reduce and combine

Common Higher-Order Functions

`reduce()` and `combine()`

The standard Perl List::Util module provides several commonly requested functions that are not built in to Perl. These include max() and min() functions, which respectively return the largest and smallest numbers in their argument lists, maxstr() and minstr(), which are the analogous functions for strings, and sum(), which returns the sum of the numbers in a list.

If we write sample code for these five functions, we'll see the similarity immediately:

        sub max { my $max = shift;
                  for (@_) { $max = $_ > $max ? $_ : $max }
                  return $max;
                }

        sub min { my $min = shift;
                  for (@_) { $min = $_ < $min ? $_ : $min }
                  return $min;
                }

        sub maxstr { my $max = shift;
                     for (@_) { $max = $_ gt $max ? $_ : $max }
                     return $max;
                   }

        sub minstr { my $min = shift;
                     for (@_) { $min = $_ lt $min ? $_ : $min }
                     return $min;
                   }

        sub sum { my $sum = shift;
                  for (@_) { $sum = $sum + $_ }
                  return $sum;
                }

Generalizing this gives us the reduce() function that is also provided by List::Util:

        sub reduce { my $code = shift;
                     my $val = shift;
                     for (@_) { $val = $code->($val, $_) }
                     return $val;
                   }

(List::Util::reduce is actually written in C for speed, but what it does it equivalent to this Perl code.) The idea is that we're going to scan the list one element at a time, accumulating a 'total' of some sort. We provide a function ($code) which says how to compute the new 'total', given the old total (first argument) and the current element (second argument). If our goal is just to add up all the list elements, then we compute the total at each stage by adding the previous total to the current element:

        reduce(sub { $_[0] + $_[1] }, @VALUES) == sum(@VALUES)

If our goal is to find the maximum element, then the 'total' is actually the maximum so far, then we compute the total at each stage by taking whichever of the current maximum and the current element is larger:

        reduce(sub { $_[0] > $_[1] ? $_[0] : $_[1] }, @VALUES) == max(@VALUES)

The reduce() function provided by List::Util is easier to call than the one above. It places the total-so-far in $a and the current list element into $b before invoking the callback, so that one can write

        reduce(sub { $a + $b }, @VALUES)
        reduce(sub { $a > $b ? $a : $b }, @VALUES)

We saw how to make this change back in Section ???, when we arranged to have imap()'s callback invoked with the current iterator value in $_ in addition to $_[0]; this allowed it to have a more map()-line calling syntax. We can arrange reduce() similarly:

        sub reduce (&@) { 
          my $code = shift;
          my $val = shift;
          for (@_) { 
            local ($a, $b) = ($val, $_); 
            $val = $code->($val, $_) 
          }
          return $val;
        }

Here we're using the global variables $a and $b to pass the total and the current list element. Use of global variables normally causes a compile-time failure under strict 'vars', but there is a special exemption for the variables $a and $b. The exemption is there to allow usages just like this one, and in particular to support the analogous feature of Perl's built-in sort() function. The List::Util version of reduce() already has this feature built in.

If we curry the reduce() function, we can use it to manufacture functions like sum() and max():

        BEGIN {
          *reduce = curry(\&List::Util::reduce);
          *sum = reduce { $a + $b };
          *max = reduce { $a > $b ? $a : $b };
        }

This version of reduce() isn't quite as general as it could be. All the functions manufactured by reduce() have one thing in common: given an empty list of arguments, they always return undef. For max() and min() this may be appropriate, but for sum() it's wrong; the sum of an empty list should be taken to be 0. (The sum() function provided by List::Util also has this defect.) This small defect masks a larger one: when the argument list is nonempty, the reduce() above assumes that the 'total' should be initialized to the first data item. This happens to work for sum() and map(), but it isn't appropriate for all functions. reduce can be made much more general if we drop this assumption. As a trivial example, suppose we want a function to produce the length of a list. This is almost what we want:

        reduce { $a + 1 };

But it only produces the correct length when given a list whose first element is 1, since otherwise $val is incorrectly initialized. A more general version of reduce() accepts an explicit parameter to say what value should be returned for an empty list:

        sub reduce (&$@) { 
          my $code = shift;
          my $val = shift;
          for (@_) { 
            local ($a, $b) = ($val, $_); 
            $val = $code->($val, $_) 
          }
          return $val;
        }

A version with optional currying is:

Download code for reduce

        sub reduce (&;$@) { 
          my $code = shift;
          my $f = sub {
            my $base_val = shift;
            my $g = sub {
              my $val = $base_val;
              for (@_) { 
                local ($a, $b) = ($val, $_); 
                $val = $code->($val, $_);
              }
              return $val;
            };
            @_ ? $g->(@_) : $g;
          };
          @_ ? $f->(@_) : $f;
        }

The list-length function is now

        *listlength = reduce { $a + 1 } 0;

where the 0 here is the correct result for an empty list. Similarly,

        *product = reduce { $a * $b } 1;

is a function which multiplies all the elements in a list of numbers. We can even use reduce() to compute both at the same time:

        *length_and_product = reduce { [$a->[0]+1, $a->[1]*$b] } [0, 1];

This makes only one pass over the list to compute both the length and the product. For an empty list, the result is [0, 1], and for a list with one element x, the result is [1, x]. List::Util::reduce() can only manufacture functions that return undef for the empty list, and that return the first list element for a single-element list. The length_and_produce() function can't be generated by List::Util::reduce() because it doesn't have these properties.

=test reduce

             sub reduce (&;$@);
         do 'reduce';
         # we're only testing the final super-duper version right now.


        # reduce(sub { $_[0] + $_[1] }, @VALUES) == sum(@VALUES)
        is(reduce(sub { $a + $b },1,2,3),6);
        # reduce(sub { $_[0] > $_[1] ? $_[0] : $_[1] }, @VALUES) == max(@VALUES)
        my $z = reduce { $_[0] > $_[1] ? $_[0] : $_[1] } 2,3,1;
        is($z, 3);

        *listlength = reduce { $a + 1 } 0;
        is(listlength(10..20), 11, "listlength nonempty");
        is(listlength(), 0, "listlength empty");

        *product = reduce { $a * $b } 1;
        is(product(2..7), 5040, "7!");
        is(product(), 1, "0!");

=endtest reduce

A properly general version of reduce() gets an additional argument, which says that the function should return when given an empty list as its argument. In the programming literature, the properly general version of reduce() is more typically called fold():

        sub fold {
          my $f = shift;
          my $fold;       
          $fold = sub {
            my $x = shift;
            sub {
              return $x unless @_;
              my $first = shift;
              $fold->($f->($x, $first), @_)
            }
          }
        }

Eliminating the recursion yields:

Download code for fold

        sub fold {
          my $f = shift;
          sub {
            my $x = shift;
            sub {
              my $r = $x;
              while (@_) {
                $r = $f->($r, shift());
              }
              return $r;
            }
          }
        }

=test fold

        do 'fold';
        do 'gcd';

        my $gcdf = fold(\&gcd)->(0);
        is($gcdf->(6,9),3);
        is($gcdf->(7,5),1);
        is($gcdf->(9,81,15),3);
        is($gcdf->(9,81,15,2),1);
        is($gcdf->(9,81,18,27),9);

=endtest fold

Boolean operators

Back in Section ??? we saw a system that would search backwards through a log file looking for records that matched a simple query. To extend this into a useful database system, we need to be able to combine simple queries into more complex ones.

Let's suppose that $a and $b are iterators that will produce data items that match queries A and B, respectively. How can we manufacture an iterator that matches the query A or B?

One way we could do this is to interleave the elements of $a and $b:

Download code for interleave

        sub interleave {
          my ($a, $b) = @_;
          return sub {
            my $next = $a->();
            unless (defined $next) {
              $a = $b;
              $next = $a->();
             }
            ($a, $b) = ($b, $a);
            $next;
          }
        }

=test interleave

        do 'interleave';
        sub upto {
           my ($m, $n) = @_;
           return sub { return $m <= $n ? $m++ : undef;  };
        }

        my $i1 = upto(1,3);
        my $i2 = upto(4,6);
        my $i = interleave($i1,$i2);
        for (qw(1 4 2 5 3 6)) {
          is($i->(),$_);
        }

        # this should be the end of stream, but it's returning a CODEref
        # instead.  $i->()->() == undef.  Is this just a sign of "the
        # interleaved outputs including some records (the end) more than
        # once?"
         
        is($i->(),undef);

=endtest interleave

But this has the drawback that if the record sets produced by $a and $b happen to overlap, the interleaved outputs will include some records more than once.

We can do better if we suppose that the records will be produced in some sort of canonical order. This assumption isn't unreasonable. Typically, a database will have a natural order dictated by the physical layout of the information on the disk and will always produce records in this natural order, at least until the data is modified. For example, our program for searching the web log file always produces matching records in the order they appear in the file. Even DBM files, which don't appear to keep records in any particular order, have a natural order; this is the order in which the records will be generated by the each() function.

Supposing that $a and $b will produce records in the same order, we can perform an 'or' operation as follows:

Download code for Iterator_Logic.pm

        package Iterator_Logic;
        use base 'Exporter';
        @EXPORT = qw(i_or_ i_or i_and_ i_and i_without_ i_without);
        
        sub i_or_ {
          my ($cmp, $a, $b) = @_;
          my ($av, $bv) = ($a->(), $b->());
          return sub {
            if (! defined $av && ! defined $bv) { return }
            elsif (! defined $av) { $rv = $bv; $bv = $b->() }
            elsif (! defined $bv) { $rv = $av; $av = $a->() }
            else {
              my $d = $cmp->($av, $bv);
              if    ($d < 0) { $rv = $av; $av = $a->() }
              elsif ($d > 0) { $rv = $bv; $bv = $b->() }
              else           { $rv = $av; $av = $a->(); $bv = $b->() }
            }
            return $rv;
          }
        }

        use Curry;
        BEGIN { *i_or = curry(\&i_or_) }

i_or_() gets a comparator function, $cmp, which defines the canonical order, and two iterators, $a and $b. It returns a new iterator which returns the next record from either $a or $b in the canonical order. If $a and $b both produce the same record, the duplicate is discarded. It begins by kicking $a and $b to obtain the next record from each. If either is exhausted, it returns the record from the other; if both are exhaused, it returns undef to indicate that there are no more records. $rv holds the record that is to be the return value.

If both input iterators produce records, the new iterator compares the records to see which should come out first. If the comparator returns zero, it means the two records are the same, and only one of them should be emitted. $rv is assigned one of the two records, as appropriate, and then one or both of the iterators is kicked to produce new records for the next call.

The logic is very similar to the merge() function of Section ???. In fact, merge() is the stream analog of the 'or' operator.

i_or() is a curried version of i_or_(), called like this:

        BEGIN { *numeric_or = i_or { $_[0] <=> $_[1] };
                *alphabetic_or = i_or { $_[0] cmp $_[1] };
         }

        $event_times =  numeric_or($access_request_times,
                        numeric_or($report_request_times,
                                   $server_start_times));

'and' is similar:

=contlisting Iterator_Logic.pm

        sub i_and_ {
          my ($cmp, $a, $b) = @_;
          my ($av, $bv) = ($a->(), $b->());
          return sub {
            my $d;
            until (! defined $av || ! defined $bv || 
                   ($d = $cmp->($av, $bv)) == 0) {
              if ($d < 0) { $av = $a->() }
              else        { $bv = $b->() }
            }
            return unless defined $av && defined $bv;
            my $rv = $av;
            ($av, $bv) = ($a->(), $b->());
            return $rv;
          }
        }

        BEGIN { *i_and = curry \&i_and_ }

=test and-or

        use Curry;
        use Iterator_Logic;

        my @a = (2, 3, 5, 7, 11, 13, 17);
        my @b = (1, 2, 3, 4, 5, 6, 7);

        my (@and, @or);
        { my %count;
          for (@a, @b) { $count{$_}++ }
          @and = grep $count{$_}==2, sort { $a <=> $b } keys %count;
          @or  = grep $count{$_}!=0, sort { $a <=> $b } keys %count;
        }
        print "# and: @and\n";
        print "# or:  @or\n";

        sub l2i {
          my @a = @_;
          my $i = 0;
          return sub {
            $a[$i++];
          };
        }

        { 
          my $and = i_and(sub { $_[0] <=> $_[1] }, l2i(@a), l2i(@b));
          my $andf = i_and(sub { $_[0] <=> $_[1] });
          my $andc = $andf->(l2i(@a), l2i(@b));
          for (@and) {
            is($and->(),  $_, "and uncurried");
            is($andc->(), $_, "and curried");
          }
          is($and->(),  undef, "and uncurried exhausted");
          is($andc->(), undef, "and curried exhausted");
        }
        
        { 
          my $or = i_or(sub { $_[0] <=> $_[1] }, l2i(@a), l2i(@b));
          my $orf = i_or(sub { $_[0] <=> $_[1] });
          my $orc = $orf->(l2i(@a), l2i(@b));
          for (@or) {
            is($or->(),  $_, "or uncurried");
            is($orc->(), $_, "or curried");
          }
          is($or->(),  undef, "or uncurried exhausted");
          is($orc->(), undef, "or curried exhausted");
        }

=endtest and-or

TOP

reduce() and combine()

Boolean operators

`reduce()` and `combine()`