ETOOBUSY 🚀 minimal blogging for the impatient
How much is rindex used?
TL;DR
How much is rindex used?
Let’s admit it: I rarely use index. Even when I could, I often just summon a regular expression and call it a day.
On the other hand, if we’re just looking for a substring inside a string, the regular expression is overkill and a plain search with index would suffice (I touched upon this button in recent post Literal string 0 is false in Perl).
So yes, I’m aware of index and remember about it when the specific need arises… sometimes. I’ve been much less aware about rindex, the elusive sibling that looks for the substring starting from the end.
Sometimes, though, it’s just what’s needed, right? Like when thinking about
a possible implementation for an endswith()
function that checks –well–
if a string ends with another string.
My recent discover [String::Util][] currently has an implementation that’s based on index:
sub endswith {
my ($str, $substr) = @_;
if (!defined($str)) {
return undef;
}
if (!$substr) {
$substr = $str;
$str = $_;
}
my $len = length($substr);
my $start = length($str) - $len;
my $ret = index($str, $substr, $start) != -1;
return $ret;
}
So I wondered whether an alternative implementation based on rindex would make sense, not so much from a functional point of view (of course you can code something that works with index, the code above is a demonstration) or even most so-called non-functional points of view (like robustness, etc.), but rather from an efficiency point of view.
So… I decided to whip up a quick, arbitrary and unscientific Benchmark, to get the gist of it:
#!/usr/bin/env perl
use v5.24;
use warnings;
use experimental 'signatures';
use Benchmark qw< :all >;
my @stuff = (
[ qw< foobar foo > ],
[ qw< barfoo foo > ],
[ qw< barfoo foobarbaz > ],
[ 'bar' . ('foo' x 100), ('foo' x 100) ],
[ ('bar' x 100) . 'foo', 'foo' ],
);
cmpthese(
-2,
{
original => sub { endswith_original($_->@*) for @stuff },
rindex => sub { endswith_rindex($_->@*) for @stuff },
},
);
sub endswith_original {
my ($str, $substr) = @_;
if (!defined($str)) {
return undef;
}
if (!$substr) {
$substr = $str;
$str = $_;
}
my $len = length($substr);
my $start = length($str) - $len;
my $ret = index($str, $substr, $start) != -1;
return $ret;
} ## end sub endswith_original
sub endswith_rindex {
my ($str, $substr) = @_;
return unless defined $str;
($str, $substr) = ($_, $str) unless defined $substr;
my $target = length($str) - length($substr);
return ($target >= 0) && rindex($str, $substr) == $target;
}
It seems to have an edge with these test cases:
$ perl string-util.pl
Rate original rindex
original 595595/s -- -42%
rindex 1032777/s 73% --
I’m not sure why, though. The endwith_original
implementation takes care
to run the index search from the right $start
point, and the two
functions do more or less the same operations (similar test, the both
calculate the $start
/$target
position where the substring is supposed to
start from, then invoke their respective incantation).
The edge increases with longer strings, e.g. changing 100
to 1000
everywhere in the code above leads us to this:
perl string-util.pl
Rate original rindex
original 168048/s -- -72%
rindex 601510/s 258% --
My wild guess is that index might not take advantage of the length
of the substring it’s looking for, so it surely $start
s from the right
spot, but then continues to look for it starting from the following
character, then the following one, etc. up to the end of the main string.
Then, of course, I didn’t address the elephant in the room, i.e. the negative case. Let’s restrict our test to the following one:
[ ('bar' x 1000), 'foo' ],
As expected, rindex is the least efficient here, because it sweeps
the whole string (almost, I think) while index’s $start
allows
dismissing the test very quickly:
$ perl string-util.pl
Rate rindex original
rindex 990720/s -- -77%
original 4230914/s 327% --
At the end of the day, I start wondering whether index/rindex are
actually the right tools for the job when we need to do tests like
startswith
and endswith
: they’re prone to spend a lot of time looking
for something in the wrong place!
So, let’s do two things:
- settle for an arbitrary but mixed case, with both types of test;
- add another contender to the mix, based on
substr
.
# ...
my @stuff = (
[ ('bar' x 1000) . 'foo', 'foo' ],
[ ('bar' x 1000) . 'BAR', 'foo' ],
)
# ...
sub endswith_substr {
my ($str, $substr) = @_;
return unless defined $str;
($str, $substr) = ($_, $str) unless defined $substr;
my $target = length($str) - length($substr);
return ($target >= 0) && substr($str, -$target) eq $substr;
}
This new contender is surely valid, although not a clear winner:
$ perl string-util.pl
Rate rindex substr original
rindex 854434/s -- -56% -61%
substr 1934041/s 126% -- -11%
original 2178605/s 155% 13% --
My other wild guess is that substr
suffers from doing a copy of the
whole string, which takes time.
So what gives? I take two things out if it:
- index (and possibly rindex, when used in a
startswith
implementation) shows a behaviour that is suspicious to me. If my first wild guess is right, there’s probably some space for improvement (disclaimer: I used version 5.32.1, maybe it’s been already optimized!) - I would be curious to code a XS version that does the exact test we’re after!
All in all it’s been an interesting ride, stay safe!