TL;DR

Another look at parsing XML… this time with more meat inside.

In yesterday’s post REX - Shallow XML parsing we saw an interesting regular expression that enables the creation of an XML parser.

Of course it’s due to ask ourselves: why on earth?!?

One advantage is of course the avoidance of any external library, which might be tricky in some environments (luckly less and less so, thanks to the spread of containers). But still.

Anyway, had I to do some fatpack-able stuff including an XML parser without pretense of validation, my go-to module would most probably be Mojo::DOM, from Mojolicious (or even its standalone counterpart Mojo::DOM58, had I only to do XML parsing).

Parsing is quite straightforward, let’s plagiarizAHEMreuse the SYNOPSIS:

use Mojo::DOM;
 
# Parse
my $dom = Mojo::DOM->new('<div><p id="a">Test</p><p id="b">123</p></div>');
 
# Find
say $dom->at('#b')->text;
say $dom->find('p')->map('text')->join("\n");
say $dom->find('[id]')->map(attr => 'id')->join("\n");
 
# Iterate
$dom->find('p[id]')->reverse->each(sub { say $_->{id} });
 
# Loop
for my $e ($dom->find('p[id]')->each) {
  say $e->{id}, ':', $e->text;
}
 
# Modify
$dom->find('div p')->last->append('<p id="c">456</p>');
$dom->at('#c')->prepend($dom->new_tag('p', id => 'd', '789'));
$dom->find(':not(p)')->map('strip');
 
# Render
say "$dom";

Not only you get a proper structure out of the XML text, you also get all the handy facilities of DOM visiting. How convenient!

Let’s try it to the other post’s XML fragment:

use 5.024;
use warnings;
use Mojo::DOM;

my $xml = <<'END';
<?xml version="1.0" encoding="UTF-8"?>
<SoftwareEngineer>
<empl id="01">
<name>
<projectname> Man-router</projectname>
<Workingdomain> machine learning</Workingdomain>
</name>
<Enddate>
<entities><![CDATA[
This is the local project with the fibre optics.
All the statistical manipulation is performed. Example. '"&<> and submission date 12/12/2020
]]></entities>
</Enddate>
</empl>
<whatever/>
</SoftwareEngineer>
END

my $dom = Mojo::DOM->new->xml(1)->parse($xml);
say '<', $dom->find('projectname')->[0]->all_text, '>';

It prints:

< Man-router>

Well… it works!