ETOOBUSY 🚀 minimal blogging for the impatient
Ordeal::Model::Parser: parsing generics
TL;DR
Some generic parsing facilities, just for starters.
Before delving into transforming the grammar described in
Ordeal::Model::Parser: grammar into code, it’s useful to introduce a
few generic parsing utilities that allow us to facilitate dealing with
alternations (i.e. the |
character in the grammar), optional multiple
instances (i.e. the start *
operator), etc. etc.
The standardized parsing function interface
Functions instrumental to parsing all share the same interface:
sub parsing_function ($reference_to_text) {
...;
return if $not_successful;
return \@array_of_collected_stuff;
}
i.e. they accept a single parameter, that is a reference to the text to
be parsed. This parameter is usually abbreviated as $rtext
in the
functions.
The return value is supposed to be undef
if the parsing is
unsuccessful, otherwise a reference to an array with the parsed stuff
inside (whatever this means for the specific parsing stage).
The generic functions described below and in the next post will usually be factory functions that return something compliant with the interface above.
Why passing a reference to the text?
To avoid copying text around, it’s useful to always refer to the same
string all over, this is why. This gives us the added benefit of
avoiding chopping text from the input as we go on (which would mean more
string copying!), because each string in Perl comes with a pos()
counter that helps tracking the position of regular expressions matching
in some specific conditions (that we are going to meet!).
This also means that all operations on the text will be performed on
$$rtext
because - well - it’s where the text lies.
Verbatim, exact text
One of the needs in a grammar is to be able and match verbatim text exactly. The following factory function takes the text we want to match, and provides us a parser that does exactly this:
1 sub __exact ($what, @retval) {
2 my $wlen = length $what;
3 return sub ($rtext) {
4 my $pos = pos($$rtext) // 0;
5 return if length($$rtext) - $pos < $wlen;
6 return if substr($$rtext, $pos, $wlen) ne $what;
7 pos($$rtext) = $pos + $wlen;
8 return [@retval];
9 };
10 }
The code is enough boring and self-explicative: to match a text exactly,
we have to ensure that we have enough characters left to analyze (line
5), that it actually matches (line 6). If the match is successful, we
advance the pos()
marker of the string (go on!) and return whatever we
wanted to emit in association to the text.
As an example, this function might be called like this:
my $exact_parser = __exact('@', 'shuffle');
This $exact_parser
complies with the parsing signature described
above, and does this:
- if the next character to be parsed is
@
then it returns an anonymous array with the stringshuffle
as its only item; - otherwise, it returns
undef
.
It is important to note one thing here: the returned function not only does parsing in the sense of validating the input according to the grammar, but also returns something associated to the parsed text. In this way, it can contribute in the building of the Abstract Syntax Tree that is, eventually, what we are interested into.
Regular expressions
This factory function helps us match regular expressions:
1 sub __regexper ($rx) {
2 return sub ($rtext) {
3 my (undef, $retval) = $$rtext =~ m{\G()$rx}cgmxs or return;
4 return [$retval];
5 };
6 }
It’s a factory function, so as expected it returns a function (line
2). It accepts a regular expression $rx
as input, so the returned
function will be different depending on the input regular expression.
The function basically takes care to match the regular expression from
the point where we arrived with the parsing. This is indicated by the
usage of the \G
anchor inside the regular expression, as well as the
use of the /cg
modifiers when calling the match. It’s basic Perl
regular expression trickery!
If you’re curious about why the undef
and the empty capture group in
line 3, you can take a look at Global string matching quirks.
The return value is:
undef
in case the regular expression does not match,- whatever the input regular expression wants captured otherwise. In
this case, the returned value is wrapped inside an anonymous array, so
that it can be
undef
itself but still be true to the caller.
Again, this return value can be considered as a leaf in the Abstract Syntax Tree.
White space
In our grammar, we neglected whitespaces, so we have to make sure to
ignore them too. Function __ews
does exactly this:
sub __ews ($rtext) { return __ewsr()->($rtext) }
sub __ewsr { state $retval = __regexper(qr{\s+}) }
This shows the difference we talked about: __ews
is something that
does parsing (note the signature), while __ewsr
is something that
returns something that does the parsing (again, see how the result
from __ewsr
is used inside __ews
).
Note that we use __regexper
to match the empty spaces, but the regular
expression we pass does not contain any capturing - this means that
we’re simply discarding the spaces!
Stay tuned!
In the next post, we will look into the first generic facilities that combine parsers together, e.g. the alternatiion and the star operator. Stay tuned!
The whole series
Want to look at the other parts of this series? Here’s a list of them:
- Global string matching quirks is not strictly in the series, but it’s our first quest in Ordeal::Model::Parser and it’s possibly the one giving a one single useful advice!
- A parser for Ordeal::Model is where the series start, introducing the motivations for the parser package.
- Ordeal::Model::Parser: grammar introduces the grammar.
- Ordeal::Model::Parser: entry point discusses the package’s main
entry point
PARSE
, which acts as a thin wrapper around_expression
. - Ordeal::Model::Parser: parsing generics deals with the starting generic helpers to build parsing functions.
- Ordeal::Model::Parser: parsing generics, 2 describes the hard stuff like sequences, alternations, and the star operator.
- Ordeal::Model::Parser: parsing gives an overview of the actual implementation of the grammar for Ordeal::Model.