TL;DR

It’s Perl Weekly Challenge time again, now on issue #081 TASK #1.

I feel like the challenges can vary a lot in difficulty. I just don’t know if it’s me (like: I complicate my life with some solutions) or they are indeed of very different difficulty.

The challenge

You are given 2 strings, $A and $B. Write a script to find out common base strings in $A and $B. A substring of a string $S is called base string if repeated concatenation of the substring results in the string.

The questions

The examples that follow the challenge make it pretty clear that the challenge requires every possible base string, not just the shortest one. So my first question would be… really? I’d ask because this makes this challenge seem like it’s two challenges. But I digress.

Other than that, I’d only be curious whether there is a specific order that the solutions should be provided back. It seems to be “by length” from one example, but it might just as well be random.

It’s fair to assume that the two strings are already in memory, so that should not be an issue. It might be worth to ask if it’s OK to create sub-copies of the strings or, for example, if they’re too long and it’s better to avoid.

The solution

The first instict is to go with the brute force. At least… most of the times, it’s surviving. Get the job done, then evaluate. Which means that this comes to mind:

sub common_bases_brute_force ($A, $B) {
  my ($lA, $lB) = (length($A), length($B));
  ($A, $B, $lA, $lB) = ($B, $A, $lB, $lA) if $lA > $lB;
  my @retval;
  CANDIDATE:
  for my $l (1 .. int($lA / 2), $lA) {
     next CANDIDATE if ($lA % $l) || ($lB % $l);
     my $base = substr $A, 0, $l;
     for my $s ($A, $B) {
        next CANDIDATE if $s ne $base x (length($s) / $l);
     }
     push @retval, $base;
  }
  return @retval;
}

The gist is: iterate over all possible base strings (i.e. all possible lengths for a base string) and check them against both strings.

Then you have a few refinements:

inputs are swapped to make $A be shorter, or at most the same length as $B (line 3). This is because it makes sense to look for bases for the shorter ones in the first place;
we rule out all lengths that are more than one half of the shortest string, except for the whole string. This is the sense of the values for the iteration in line 6, which goes through all possible base string lengths;
if the length of either string is not a multiple of the candidate base $l we don’t even bother checking (line 7);
if any of the two string is not generated by the candidate base we skip to the next candidate (line 10)

If all checks are fine, we get the base and move on. Yay!

Then, of course, the lurking monster gets into the stage. Is this really needed? Isn’t there a better, more efficient way?

The answer is… depends. If the input strings are short and the test is performed rarely, who cares? The solution above is so easy that anything else would be a waste of precious programmer cycles. It’s even so easy that it documents itself (more or less).

But, for sake of discussion… let’s say we want to complicate our lives. Then we could think something like this:

first, find the minimal base string for both;
if these two differ, don’t bother moving forward;
otherwise, find all possible repetitions of this minimal common base that would fit both strings. This can be done without string comparisons, only by playing with numbers.

Fact is that I suspect this might be more efficient, but I didn’t bother doing any benchmark! Anyway, let’s look at it.

sub proper_factors ($n) { grep { $n % $_ == 0} (2 .. int($n/2))}

sub min_period ($string) {
  my $n = length $string;

  CANDIDATE:
  for my $k (1, proper_factors($n)) {
     my $m = $n / $k; # sub-sequences we have to test
     for my $i (0 .. $k - 1) { # sub-sequence iterator
        my $char = substr $string, $i, 1;
        for my $s (1 .. $m - 1) { # sequence iterator
           next CANDIDATE if $char ne substr $string, $k * $s + $i, 1;
        }
     }
     # yay!
     return $k;
  }

  # nothing found, minimum period is the string's length
  return $n;
}

We start with a function to find the minimal base string, which corresponds to finiding the minimum period of a string. This is quite similar to the previous brute force attack:

we iterate over 1 or proper factors only (line 7);
we do a check that the length indeed yields a base string (lines 9 to 14);
we return the length of the whole string if nothing better was found in the iteration (line 20).

This function min_period can now be used to find (if any) the minimal common base:

sub min_common_base ($A, $B) {
  my $pA = min_period($A);
  my $pB = min_period($B);
  return if $pB != $pA; # they must be equal
  my $candidate = substr $A, 0, $pA;
  return $candidate if $candidate eq substr $B, 0, $pB;
  return;
}

This is just a fancy way to say: these two minimum bases must be equal or there’s no deal.

Now we’re ready for the higher level function:

sub common_bases ($A, $B) {
  defined(my $b = min_common_base($A, $B)) or return;

  my $l = length $b;
  my ($rA, $rB) = map {length($_) / $l} ($A, $B);
  ($rA, $rB) = ($rB, $rA) if $rA > $rB;

  return map { $rB % $_ ? () : $b x $_ } (1, proper_factors($rA), $rA);
}

If the two strings have no minimal common base, no need to search further (line 2).

If they do have the same minimal common string, though, we iterate over all possible arrangement that would fit the shorter one and check them against the longer one. This can be done by simply looking at the lengths of the strings, because we know that any base string must fit into the generated string an integral number of times.

So we first find the number of repetitions that fit the minimal common base into both $A and $B (line 5), then we make sure that we work on the shorter one (line 6, swaps the two repetition numbers to make sure that $rA is the lower of the two).

Line 8 is an iteration over all possible candidates according to $A/$rA (i.e. all factors of $rA, both proper and improper) and check if they would fit also $B (by a check of divisibility).

For very long strings… this hopefully goes faster than the brute force attack!

If you want to play with the whole script…

#!/usr/bin/env perl
use 5.024;
use warnings;
use experimental qw< postderef signatures >;
no warnings qw< experimental::postderef experimental::signatures >;

sub proper_factors ($n) { grep { $n % $_ == 0} (2 .. int($n/2))}

sub min_period ($string) {
   my $n = length $string;

   CANDIDATE:
   for my $k (1, proper_factors($n)) {
      my $m = $n / $k; # sub-sequences we have to test
      for my $i (0 .. $k - 1) { # sub-sequence iterator
         my $char = substr $string, $i, 1;
         for my $s (1 .. $m - 1) { # sequence iterator
            next CANDIDATE if $char ne substr $string, $k * $s + $i, 1;
         }
      }
      # yay!
      return $k;
   }

   # nothing found, minimum period is the string's length
   return $n;
}

sub min_common_base ($A, $B) {
   my $pA = min_period($A);
   my $pB = min_period($B);
   return if $pB != $pA; # they must be equal
   my $candidate = substr $A, 0, $pA;
   return $candidate if $candidate eq substr $B, 0, $pB;
   return;
}

sub common_bases ($A, $B) {
   defined(my $b = min_common_base($A, $B)) or return;

   my $l = length $b;
   my ($rA, $rB) = map {length($_) / $l} ($A, $B);
   ($rA, $rB) = ($rB, $rA) if $rA > $rB;

   return map { $rB % $_ ? () : $b x $_ } (1, proper_factors($rA), $rA);
}

sub common_bases_brute_force ($A, $B) {
   my ($lA, $lB) = (length($A), length($B));
   ($A, $B, $lA, $lB) = ($B, $A, $lB, $lA) if $lA > $lB;
   my @retval;
   CANDIDATE:
   for my $l (1 .. int($lA / 2), $lA) {
      next CANDIDATE if ($lA % $l) || ($lB % $l);
      my $base = substr $A, 0, $l;
      for my $s ($A, $B) {
         next CANDIDATE if $s ne $base x (length($s) / $l);
      }
      push @retval, $base;
   }
   return @retval;
}

for my $input (
   ['abcdabcd', 'abcdabcdabcdabcd'],
   ['aaa', 'aa'],
){
   say '(', join(', ', map {qq{"$_"}} common_bases($input->@*)), ')';
}

I guess this is it!

ETOOBUSY 🚀 minimal blogging for the impatient

PWC081 - Common Base String

The challenge

The questions

The solution