`reduce()` and `combine()`

The standard Perl `List::Util` module provides several commonly
requested functions that are not built in to Perl. These include
`max()` and `min()` functions, which respectively return the largest and
smallest numbers in their argument lists, `maxstr()` and `minstr()`,
which are the analogous functions for strings, and `sum()`, which
returns the sum of the numbers in a list.

If we write sample code for these five functions, we'll see the similarity immediately:

sub max { my $max = shift; for (@_) { $max = $_ > $max ? $_ : $max } return $max; } sub min { my $min = shift; for (@_) { $min = $_ < $min ? $_ : $min } return $min; } sub maxstr { my $max = shift; for (@_) { $max = $_ gt $max ? $_ : $max } return $max; } sub minstr { my $min = shift; for (@_) { $min = $_ lt $min ? $_ : $min } return $min; } sub sum { my $sum = shift; for (@_) { $sum = $sum + $_ } return $sum; }

Generalizing this gives us the `reduce()` function that is also
provided by `List::Util`:

sub reduce { my $code = shift; my $val = shift; for (@_) { $val = $code->($val, $_) } return $val; }

(`List::Util::reduce` is actually written in C for speed, but what it
does it equivalent to this Perl code.) The idea is that we're going
to scan the list one element at a time, accumulating a 'total' of some
sort. We provide a function (`$code`) which says how to compute the
new 'total', given the old total (first argument) and the current
element (second argument). If our goal is just to add up all the list
elements, then we compute the total at each stage by adding the
previous total to the current element:

reduce(sub { $_[0] + $_[1] }, @VALUES) == sum(@VALUES)

If our goal is to find the maximum element, then the 'total' is actually the maximum so far, then we compute the total at each stage by taking whichever of the current maximum and the current element is larger:

reduce(sub { $_[0] > $_[1] ? $_[0] : $_[1] }, @VALUES) == max(@VALUES)

The `reduce()` function provided by `List::Util` is easier to call
than the one above. It places the total-so-far in `$a` and the
current list element into `$b` before invoking the callback, so that
one can write

reduce(sub { $a + $b }, @VALUES) reduce(sub { $a > $b ? $a : $b }, @VALUES)

We saw how to make this change back in Section ???, when we
arranged to have `imap()`'s callback invoked with the current iterator
value in `$_` in addition to `$_[0]`; this allowed it to have a more
`map()`-line calling syntax. We can arrange `reduce()` similarly:

sub reduce (&@) { my $code = shift; my $val = shift; for (@_) {local ($a, $b) = ($val, $_);$val = $code->($val, $_) } return $val; }

Here we're using the global variables `$a` and `$b` to pass the
total and the current list element. Use of global variables normally
causes a compile-time failure under `strict 'vars'`, but there is
a special exemption for the variables `$a` and `$b`. The exemption
is there to allow usages just like this one, and in particular to
support the analogous feature of Perl's built-in `sort()` function.
The `List::Util` version of `reduce()` already has this feature built
in.

If we curry the `reduce()` function, we can use it to *manufacture*
functions like `sum()` and `max()`:

BEGIN { *reduce = curry(\&List::Util::reduce); *sum = reduce { $a + $b }; *max = reduce { $a > $b ? $a : $b }; }

This version of `reduce()` isn't quite as general as it could be. All
the functions manufactured by `reduce()` have one thing in common:
given an empty list of arguments, they always return undef. For
`max()` and `min()` this may be appropriate, but for `sum()` it's wrong;
the sum of an empty list should be taken to be 0. (The `sum()`
function provided by `List::Util` also has this defect.) This small
defect masks a larger one: when the argument list is nonempty, the
`reduce()` above assumes that the 'total' should be initialized to the
first data item. This happens to work for `sum()` and `map()`, but it
isn't appropriate for all functions. `reduce` can be made much more
general if we drop this assumption. As a trivial example, suppose we
want a function to produce the length of a list. This is *almost*
what we want:

reduce { $a + 1 };

But it only produces the correct length when given a list whose first
element is 1, since otherwise `$val` is incorrectly initialized. A
more general version of `reduce()` accepts an explicit parameter to say
what value should be returned for an empty list:

sub reduce (&$@) {my $code = shift; my $val = shift; for (@_) { local ($a, $b) = ($val, $_); $val = $code->($val, $_) } return $val; }

A version with optional currying is:

sub reduce (&;$@) { my $code = shift; my $f = sub { my $base_val = shift; my $g = sub { my $val = $base_val; for (@_) { local ($a, $b) = ($val, $_); $val = $code->($val, $_); } return $val; }; @_ ? $g->(@_) : $g; }; @_ ? $f->(@_) : $f; }

The list-length function is now

*listlength = reduce { $a + 1 } 0;

where the `0` here is the correct result for an empty list.
Similarly,

*product = reduce { $a * $b } 1;

is a function which multiplies all the elements in a list of numbers.
We can even use `reduce()` to compute both at the same time:

*length_and_product = reduce { [$a->[0]+1, $a->[1]*$b] } [0, 1];

This makes only one pass over the list to compute both the length and
the product. For an empty list, the result is `[0, 1]`, and for a
list with one element `x`, the result is `[1, x]`.
`List::Util::reduce()` can only manufacture functions that return undef
for the empty list, and that return the first list element for a
single-element list. The `length_and_produce()` function can't be
generated by `List::Util::reduce()` because it doesn't
have these properties.

sub reduce (&;$@); do 'reduce'; # we're only testing the final super-duper version right now. # reduce(sub { $_[0] + $_[1] }, @VALUES) == sum(@VALUES) is(reduce(sub { $a + $b },1,2,3),6); # reduce(sub { $_[0] > $_[1] ? $_[0] : $_[1] }, @VALUES) == max(@VALUES) my $z = reduce { $_[0] > $_[1] ? $_[0] : $_[1] } 2,3,1; is($z, 3); *listlength = reduce { $a + 1 } 0; is(listlength(10..20), 11, "listlength nonempty"); is(listlength(), 0, "listlength empty"); *product = reduce { $a * $b } 1; is(product(2..7), 5040, "7!"); is(product(), 1, "0!");=endtest reduce

A properly general version of `reduce()` gets an additional argument,
which says that the function should return when given an empty list as
its argument. In the programming literature, the properly general
version of `reduce()` is more typically called `fold()`:

sub fold { my $f = shift; my $fold; $fold = sub { my $x = shift; sub { return $x unless @_; my $first = shift; $fold->($f->($x, $first), @_) } } }

Eliminating the recursion yields:

sub fold { my $f = shift; sub { my $x = shift; sub { my $r = $x;=test foldwhile (@_) {$r = $f->($r, shift());}return $r;} } }

do 'fold'; do 'gcd'; my $gcdf = fold(\&gcd)->(0); is($gcdf->(6,9),3); is($gcdf->(7,5),1); is($gcdf->(9,81,15),3); is($gcdf->(9,81,15,2),1); is($gcdf->(9,81,18,27),9);=endtest fold

### Boolean operators

Back in Section ??? we saw a system that would search backwards through a log file looking for records that matched a simple query. To extend this into a useful database system, we need to be able to combine simple queries into more complex ones.

Let's suppose that `$a` and `$b` are iterators that will produce
data items that match queries `A` and `B`, respectively. How can we
manufacture an iterator that matches the query A or B?

One way we could do this is to interleave the elements of `$a` and
`$b`:

sub interleave { my ($a, $b) = @_; return sub { my $next = $a->(); unless (defined $next) { $a = $b; $next = $a->(); } ($a, $b) = ($b, $a); $next; } }=test interleave

do 'interleave'; sub upto { my ($m, $n) = @_; return sub { return $m <= $n ? $m++ : undef; }; } my $i1 = upto(1,3); my $i2 = upto(4,6); my $i = interleave($i1,$i2); for (qw(1 4 2 5 3 6)) { is($i->(),$_); } # this should be the end of stream, but it's returning a CODEref # instead. $i->()->() == undef. Is this just a sign of "the # interleaved outputs including some records (the end) more than # once?" is($i->(),undef);=endtest interleave

But this has the drawback that if the record sets produced by `$a`
and `$b` happen to overlap, the interleaved outputs will include some
records more than once.

We can do better if we suppose that the records will be produced in
some sort of canonical order. This assumption isn't unreasonable.
Typically, a database will have a natural order dictated by the
physical layout of the information on the disk and will always produce
records in this natural order, at least until the data is modified.
For example, our program for searching the web log file always
produces matching records in the order they appear in the file. Even
DBM files, which don't appear to keep records in any particular order,
have a natural order; this is the order in which the records will be
generated by the `each()` function.

Supposing that `$a` and `$b` will produce records in the same order,
we can perform an 'or' operation as follows:

Download code for `Iterator_Logic.pm`

package Iterator_Logic; use base 'Exporter'; @EXPORT = qw(i_or_ i_or i_and_ i_and i_without_ i_without); sub i_or_ { my ($cmp, $a, $b) = @_; my ($av, $bv) = ($a->(), $b->()); return sub { if (! defined $av && ! defined $bv) { return } elsif (! defined $av) { $rv = $bv; $bv = $b->() } elsif (! defined $bv) { $rv = $av; $av = $a->() } else { my $d = $cmp->($av, $bv); if ($d < 0) { $rv = $av; $av = $a->() } elsif ($d > 0) { $rv = $bv; $bv = $b->() } else { $rv = $av; $av = $a->(); $bv = $b->() } } return $rv; } } use Curry; BEGIN { *i_or = curry(\&i_or_) }

`i_or_()` gets a comparator function, `$cmp`, which defines the
canonical order, and two iterators, `$a` and `$b`. It returns a new
iterator which returns the next record from either `$a` or `$b` in
the canonical order. If `$a` and `$b` both produce the same record,
the duplicate is discarded. It begins by kicking `$a` and `$b` to
obtain the next record from each. If either is exhausted, it returns
the record from the other; if both are exhaused, it returns `undef`
to indicate that there are no more records. `$rv` holds the record
that is to be the return value.

If both input iterators produce records, the new iterator compares the
records to see which should come out first. If the comparator returns
zero, it means the two records are the same, and only one of them
should be emitted. `$rv` is assigned one of the two records, as
appropriate, and then one or both of the iterators is kicked to
produce new records for the next call.

The logic is very similar to the `merge()` function of
Section ???. In fact, `merge()` is the stream analog of
the 'or' operator.

`i_or()` is a curried version of `i_or_()`, called like this:

BEGIN { *numeric_or = i_or { $_[0] <=> $_[1] }; *alphabetic_or = i_or { $_[0] cmp $_[1] }; } $event_times = numeric_or($access_request_times, numeric_or($report_request_times, $server_start_times));

'and' is similar:

=contlisting Iterator_Logic.pmsub i_and_ { my ($cmp, $a, $b) = @_; my ($av, $bv) = ($a->(), $b->()); return sub { my $d; until (! defined $av || ! defined $bv || ($d = $cmp->($av, $bv)) == 0) { if ($d < 0) { $av = $a->() } else { $bv = $b->() } } return unless defined $av && defined $bv; my $rv = $av; ($av, $bv) = ($a->(), $b->()); return $rv; } } BEGIN { *i_and = curry \&i_and_ }=test and-or

use Curry; use Iterator_Logic; my @a = (2, 3, 5, 7, 11, 13, 17); my @b = (1, 2, 3, 4, 5, 6, 7); my (@and, @or); { my %count; for (@a, @b) { $count{$_}++ } @and = grep $count{$_}==2, sort { $a <=> $b } keys %count; @or = grep $count{$_}!=0, sort { $a <=> $b } keys %count; } print "# and: @and\n"; print "# or: @or\n"; sub l2i { my @a = @_; my $i = 0; return sub { $a[$i++]; }; } { my $and = i_and(sub { $_[0] <=> $_[1] }, l2i(@a), l2i(@b)); my $andf = i_and(sub { $_[0] <=> $_[1] }); my $andc = $andf->(l2i(@a), l2i(@b)); for (@and) { is($and->(), $_, "and uncurried"); is($andc->(), $_, "and curried"); } is($and->(), undef, "and uncurried exhausted"); is($andc->(), undef, "and curried exhausted"); } { my $or = i_or(sub { $_[0] <=> $_[1] }, l2i(@a), l2i(@b)); my $orf = i_or(sub { $_[0] <=> $_[1] }); my $orc = $orf->(l2i(@a), l2i(@b)); for (@or) { is($or->(), $_, "or uncurried"); is($orc->(), $_, "or curried"); } is($or->(), undef, "or uncurried exhausted"); is($orc->(), undef, "or curried exhausted"); }=endtest and-or

TOP