REX - Shallow XML parsing

TL;DR

Shallow parsing of XML with a regular expression.

Navingating around I stumbled upon REX, an interesting article about parsing XML with regular expressions. (By the way, it’s used in XML::Parser::REX).

The first thing that was quite interesting to me was that this is an article from 1998. It’s still online. This is, in itself, amazing.

The second thing that struck me was this:

The syntax of Extensible Markup Language (XML) is simple enough that it is possible to contemplate lightweight XML parsers based entirely on regular expression technology. This is no accident; the ease of writing programs to process XML documents was an important design goal for XML […].

Up to now, I only knew the mantra don’t use regular expressions to parse HTML/XML. So it’s true that many times, after having learned a rule… you have to unlearn it!

Well, of course the kind of caveat above was about trying to do too much with a regular expression, which REX does not… but still!

I took the Perl version and adapted it to work as an iterator:

use 5.024;
use warnings;
use experimental qw< signatures >;
no warnings qw< experimental::signatures >;

# REX/Perl 1.0 
# Robert D. Cameron "REX: XML Shallow Parsing with Regular Expressions",
# Technical Report TR 1998-17, School of Computing Science, Simon Fraser 
# University, November, 1998.
# Copyright (c) 1998, Robert D. Cameron. 
# The following code may be freely used and distributed provided that
# this copyright and citation notice remains intact and that modifications
# or additions are clearly identified.

sub xml_it ($input) {
   state $TextSE = "[^<]+";
   state $UntilHyphen = "[^-]*-";
   state $Until2Hyphens = "$UntilHyphen(?:[^-]$UntilHyphen)*-";
   state $CommentCE = "$Until2Hyphens>?";
   state $UntilRSBs = "[^\\]]*](?:[^\\]]+])*]+";
   state $CDATA_CE = "$UntilRSBs(?:[^\\]>]$UntilRSBs)*>";
   state $S = "[ \\n\\t\\r]+";
   state $NameStrt = "[A-Za-z_:]|[^\\x00-\\x7F]";
   state $NameChar = "[A-Za-z0-9_:.-]|[^\\x00-\\x7F]";
   state $Name = "(?:$NameStrt)(?:$NameChar)*";
   state $QuoteSE = "\"[^\"]*\"|'[^']*'";
   state $DT_IdentSE = "$S$Name(?:$S(?:$Name|$QuoteSE))*";
   state $MarkupDeclCE = "(?:[^\\]\"'><]+|$QuoteSE)*>";
   state $S1 = "[\\n\\r\\t ]";
   state $UntilQMs = "[^?]*\\?+";
   state $PI_Tail = "\\?>|$S1$UntilQMs(?:[^>?]$UntilQMs)*>";
   state $DT_ItemSE = "<(?:!(?:--$Until2Hyphens>|[^-]$MarkupDeclCE)|\\?$Name(?:$PI_Tail))|%$Name;|$S";
   state $DocTypeCE = "$DT_IdentSE(?:$S)?(?:\\[(?:$DT_ItemSE)*](?:$S)?)?>?";
   state $DeclCE = "--(?:$CommentCE)?|\\[CDATA\\[(?:$CDATA_CE)?|DOCTYPE(?:$DocTypeCE)?";
   state $PI_CE = "$Name(?:$PI_Tail)?";
   state $EndTagCE = "$Name(?:$S)?>?";
   state $AttValSE = "\"[^<\"]*\"|'[^<']*'";
   state $ElemTagCE = "$Name(?:$S$Name(?:$S)?=(?:$S)?(?:$AttValSE))*(?:$S)?/?>?";
   state $MarkupSPE = "<(?:!(?:$DeclCE)?|\\?(?:$PI_CE)?|/(?:$EndTagCE)?|(?:$ElemTagCE)?)";
   state $XML_SPE = "$TextSE|$MarkupSPE";

   return sub {
      $input =~ m{\G($XML_SPE)}cg or return;
      return $1;
   };
}

All the parts of the regular expression are declared as state variable to avoid re-defining them over and over. The function returns an iterator that returns the next portion at each call, so it can be used like this:

my $xml = <<'END';
<?xml version="1.0" encoding="UTF-8"?>
<SoftwareEngineer>
<empl id="01">
<name>
<projectname> Man-router</projectname>
<Workingdomain> machine learning</Workingdomain>
</name>
<Enddate>
<entities><![CDATA[
This is the local project with the fibre optics.
All the statistical manipulation is performed. Example. '"&<> and submission date 12/12/2020
]]></entities>
</Enddate>
</empl>
<whatever/>
</SoftwareEngineer>
END

my $it = xml_it($xml);
my $n = 0;
while (defined(my $element = $it->())) {
   next unless $element =~ m{\S}mxs;
   $element =~ s{^}{   | }gmxs;
   substr $element, 0, 3, sprintf '%3d', ++$n;
   say $element;
}

This gives us the following (note that we’re skipping all spaces-only captures):

  1| <?xml version="1.0" encoding="UTF-8"?>
  2| <SoftwareEngineer>
  3| <empl id="01">
  4| <name>
  5| <projectname>
  6|  Man-router
  7| </projectname>
  8| <Workingdomain>
  9|  machine learning
 10| </Workingdomain>
 11| </name>
 12| <Enddate>
 13| <entities>
 14| <![CDATA[
   | This is the local project with the fibre optics.
   | All the statistical manipulation is performed. Example. '"&<> and submission date 12/12/2020
   | ]]>
 15| </entities>
 16| </Enddate>
 17| </empl>
 18| <whatever/>
 19| </SoftwareEngineer>

This explains why it’s shallow: you “only” get the sequence of elements, it will be up to you to make sense of it afterwards (e.g. to build a tree out of it).

Interesting anyway!


Comments? Octodon, , GitHub, Reddit, or drop me a line!