Home-brewn sets

TL;DR

Some benchmarks on different ways of implementing (little) sets.

A Set is (among other things):

an abstract data type that can store unique values, without any particular order.

(taken from the linked Wikipedia page).

In other terms, it holds a collection of items, each being “unique”, and is capable of answering questions like does this item belong to the set? and the like.

This is often very useful to have at hand, especially when you’re solving coding puzzles 🙄

My default go-to solution for implementing a set operation in Perl is by means of a hash. If we ignore the values, we basically have a useful impelementation of a set for strings:

my @items = qw< foo bar baz foo hey >;  # note: foo is duplicated!
my %set = map { $_ => 1 } @items;

say $_, ': ', $set{$_} ? 'inside' : 'outside'
    for qw< foo whatever hey you >;

The code above prints:

foo: inside
whatever: outside
hey: inside
you: outside

There can be a lot of variations, even by just using hashes. As an example, how are we going to initialize the hash with the elements? Two examples:

use constant NEVER  => 7;
use constant ALWAYS => 9;

...

sub by_hash (@input) {
   my %set = map { $_ => 1 } @input;
   $set{$input[rand @input]} or die 'fail in provided element';
   $set{+ALWAYS} or die 'fail on ALWAYS';
   $set{+NEVER} and die 'fail on NEVER';
}

sub by_hash2 (@input) {
   my %set;
   @set{@input} = (1) x @input;
   $set{$input[rand @input]} or die 'fail in provided element';
   $set{+ALWAYS} or die 'fail on ALWAYS';
   $set{+NEVER} and die 'fail on NEVER';
}

And more: is it better to use a check on the value, or use exists?

On the other hand… this flexibility might come to a cost. What if we need to check something more restricted, but a lot of times? We might remap our elements onto integers, and use arrays instead:

sub by_array (@input) {
   my @set;
   $set[$_] = 1 for @input;
   $set[$input[rand @input]] or die 'fail in provided element';
   $set[ALWAYS] or die 'fail on ALWAYS';
   $set[NEVER] and die 'fail on NEVER';
}

Is this better? What if we have so few elements that they can be mapped onto the bits of an integer?

sub by_bits (@input) {
   my $set = 0;
   $set |= 0x01 << $_ for @input;
   $set & (1 << $input[rand @input]) or die 'fail in provided element';
   $set & (1 << ALWAYS) or die 'fail on ALWAYS';
   $set & (1 << NEVER) and die 'fail on NEVER';
}

[Benchmark][] time!

           Rate   Hash2    Hash Hash SA Hash ESA  Array    Bits Array SA Bits SA
Hash2     226/s      --     -2%    -13%     -14%   -58%    -66%     -69%    -77%
Hash      231/s      2%      --    -11%     -12%   -57%    -65%     -68%    -77%
Hash SA   260/s     15%     13%      --      -1%   -52%    -61%     -64%    -74%
Hash ESA  262/s     16%     13%      1%       --   -52%    -60%     -64%    -74%
Array     544/s    141%    135%    109%     108%     --    -18%     -25%    -46%
Bits      660/s    192%    185%    153%     152%    21%      --      -9%    -34%
Array SA  724/s    220%    213%    178%     176%    33%     10%       --    -28%
Bits SA  1004/s    344%    334%    285%     283%    85%     52%      39%      --

(Full script here).

All functions are called based on the same inputs, which are generated automatically at the beginning and then reused over and over:

sub tests_iterator {
   state $vs = [ grep { $_ != NEVER } 0 .. 31 ];
   state $tests = [
      map {
         my $n = 3 + int(rand(27));
         [ shuffle(ALWAYS, map { $vs->[rand $vs->@*] } 0 .. $n) ];
      } 1 .. N_ARRANGEMENTS
   ];
   my $i = 0;
   return sub { return $i < N_ARRANGEMENTS ? $tests->[$i++]->@* : () };
}

Hash, Hash2, Array, and Bits are the functions above, wrapped into a driver function to call it multiple times with several inputs ($cb_name is the name of one of the functions described above):

sub wrap ($cb_name) {
   my $cb = __PACKAGE__->can($cb_name);
   return sub {
      my $ti = tests_iterator();
      while (my @input = $ti->()) {
         eval {
            $cb->(@input);
            1;
         } or do {
            warn "$cb_name: $EVAL_ERROR";
         }
      }
      return;
   }
}

As all these calling of the callback introduce their own overhead, we can also code standalone versions of those functions, just to see how it goes (and results show that it goes definitely better).

These are all the SA versions. As an example, the bit-based implementation and the winner in this round is the following:

sub by_bits_standalone {
   my $ti = tests_iterator();
   while (my @input = $ti->()) {
      my $set = 0;
      $set |= 0x01 << $_ for @input;
      $set & (1 << $input[rand @input]) or die 'fail in provided element';
      $set & (1 << ALWAYS) or die 'fail on ALWAYS';
      $set & (1 << NEVER) and die 'fail on NEVER';
   }
   return;
}

The Hash ESA is hash-based, with the explicit loop to avoid the callign overhead, but using exists instead of the value.

I’m not too surprised of the results, except maybe that I didn’t expect such a good performance from arrays.

           Rate   Hash2    Hash Hash SA Hash ESA  Array    Bits Array SA Bits SA
Hash2     226/s      --     -2%    -13%     -14%   -58%    -66%     -69%    -77%
Hash      231/s      2%      --    -11%     -12%   -57%    -65%     -68%    -77%
Hash SA   260/s     15%     13%      --      -1%   -52%    -61%     -64%    -74%
Hash ESA  262/s     16%     13%      1%       --   -52%    -60%     -64%    -74%
Array     544/s    141%    135%    109%     108%     --    -18%     -25%    -46%
Bits      660/s    192%    185%    153%     152%    21%      --      -9%    -34%
Array SA  724/s    220%    213%    178%     176%    33%     10%       --    -28%
Bits SA  1004/s    344%    334%    285%     283%    85%     52%      39%      --

It’s also interesting that the standalone versions underline the differences in performance. This might be explained by the fact that more efficient functions are called more times, so they suffer from the performance hit more than their counterparts; when we remove this overhead, there’s more time to do more iterations on what we’re measuring.

I this this is everything for today, stay safe folks!


Comments? Octodon, Twitter, GitHub, Reddit, or drop me a line!