ETOOBUSY š minimal blogging for the impatient
PWC098 - Read N-characters
TL;DR
A digression on TASK #1 from the Perl Weekly Challenge #098. Enjoy!
This post is probably longer than it should be, skip to the solution if you want to spare you a lot of ramblings.
The challenge
You are given file
$FILE
. Create subroutinereadN($FILE, $number)
returns the first n-characters and moves the pointer to the(n+1)th
character.
The questions
The first question this time was for myself: did you read it correctly? I initially thought that this challenge was a piece of cake, at least for any language with the notion of a filehand (or whatever their equivalent).
Then yes, I realized that the input across the calls is a file name, not a file handle. Which changes the game completely: it will be up to the function to track the state.
Which (at the very last!) brings me to the questions that should probably pop up in an interview:
- How to deal with errors?
- We will assume that throwing an exception is fine here.
- Is the file supposed to grow in time?
- We will assume that it might.
- What to do when we hit the end of file?
- We will return as many data as possible, possibly and empty string.
- What do we mean by character exactly? Is there any specific encoding
we should take care of?
- We will assume that characters are the same as bytesā¦ how convenient!
The solution
As already anticipated in the questions, we will have to keep track of where we are in the file for each of them. This is something that is usually done for us by Perl by means of the filehandle abstraction, which is also way more powerful as it allows for the same file to be opened (and tracked) multiple times.
But Iām digressing.
To keep state, since version 5.10 we can leverage the state
variable
declarator:
sub readN ($FILE, $number) {
state $at = {};
# ...
}
If you really want to go a lot back in time and avoid this (why should you want to do this?!?) you will of course have to ditch the function signatures and use a closed-on variable instead, like this:
my $at = {}; # it might be a full-fledged hash instead
sub readN { my ($FILE, $number) = @_;
# ...
}
This is suboptimal though, because the variable is then leaked to the rest of the program scope. Which usually means that itās better to protect the whole thing with a block:
{
my $at = {}; # it might be a full-fledged hash instead
sub readN { my ($FILE, $number) = @_;
# ...
}
}
But Iām digressing.
To track each fileās state, we might do several things, like:
- keep a filehandle for each file and use it when it is necessary. If no filehandle is present, then a new one will be opened;
- keep the character number were we are supposed to read next.
The first approach has the advantage of being conceptually simple, as
well as calling the open
function once per file. Is this of any help?
I honestly donāt know, but Iād say no. Moreover, re-opening the file
over and over does not really remove any readability, so no harm in not
using this approach.
On the other handā¦ this will allocate a file handle, which can generally be considered a scarce resource. Thereās a limit on how many you can have, although it can be set and moved. Hence, if we have a solution that can give us the same level of clarity and allow us to spare resources.
Which, arguably, might not really be the point here, because we are discussing a challenge. Unearthing these considerations might help understand the problem better in real life situation, anyway, so if a premature optimization is the root of all evil, blindly disregarding performance as a topic is equally evil in my opinion.
But Iām digressing.
Reading data from a file is usually done through the open/read/close in a normal Perl enthusiastās day. Well, actually through readline (in that normal day).
This time, anyway, itās better to go at a bit lower level, e.g. to avoid buffering of inputs. For this reason, weāll resort to the less-used sysopen/sysread/close triad to go straight to the facilities provided by the operating system.
This approach, though, has its consequences. We are assuming that
bytes are the same as characters here, because there seems to be no
indication of how to reliably infer the encoding from the file. Were we
to support e.g. UTF-8 encoding, then reading $number
characters would
be a (generally) different beast, and we would have to either figure out
how to use the facilities that Perl comes with, or re-implement the
decoding at some level.
But Iām digressing.
We decided to keep the ācurrentā location in the file as an integer offset from the start of the file. For this, we rely on another function that I rarely use, that is sysseek.
Both sysopen and sysseek require integer values to pass
additional options, e.g. the opening mode or from where to start
looking. For this reason, itās useful to use
module Fcntl and
import a couple of constants to help us refer to the right values
instead of hardwiring magic values in the code:
use Fcntl qw< O_RDONLY SEEK_SET >;
Itās a low-level interface, right? I can imagine that coders might have turned the interface to use names/symbols instead of integer values, but I guess this has been done for a couple of reasons:
- in a low-level interface, you use low-level tools and the common reference is language C;
- if we have to use a low-level interface, automating this part if probably the least of our problems.
For this reason, the Fcntl approach of providing constants at the
expense of a simple additional use
statement is fair.
But Iām digressing.
If you made it so far without hiring a hitman to kill me š , hereās my solution to this weekās challenge:
sub readN ($FILE, $number) {
state $at = {};
sysopen my $fh, $FILE, O_RDONLY or die "sysopen('$FILE'): $!\n";
sysseek $fh, $at->{$FILE} // 0, SEEK_SET;
my $retval = '';
my $n = sysread $fh, $retval, $number;
close $fh or die "close('$FILE'): $!\n";
die "sysread($FILE) \@$number: $!\n" if ! defined $n;
$at->{$FILE} += $n;
return $retval;
}
The file is opened, read and closed at every call as anticipated.
I have to admit that I added a check on the close too, which is something that I rarely do. Not much because Iām lazy (although I am indeed lazy), but because Iām always dubious about what an error in close should mean for my program. If the sysread was fineā¦ who cares? So, this line is actually more a cargo-cult bow to purity than something that I truly mean/understand in this case.
But Iām digressing. For the last time, I swear! š
The whole program, as usual. We default to the programās file as the input:
#!/usr/bin/env perl
use 5.024;
use warnings;
use experimental qw< postderef signatures >;
no warnings qw< experimental::postderef experimental::signatures >;
use Fcntl qw< O_RDONLY SEEK_SET >;
sub readN ($FILE, $number) {
state $at = {};
sysopen my $fh, $FILE, O_RDONLY or die "sysopen('$FILE'): $!\n";
sysseek $fh, $at->{$FILE} // 0, SEEK_SET;
my $retval = '';
my $n = sysread $fh, $retval, $number;
die "sysread($FILE) \@$number: $!\n" if ! defined $n;
$at->{$FILE} += $n;
return $retval;
}
my $highlight = "\e[1;97;45m"; my $reset = "\e[0m";
my $file = shift || __FILE__;
my @numbers = @ARGV ? @ARGV : qw< 4 5 2 >;
for my $n (@numbers) {
my $chunk = readN($file, $n);
say "got $n: $highlight$chunk$reset";
}
Stay safe!