ETOOBUSY đ minimal blogging for the impatient
Resuming Data::Tubes
TL;DR
I (re)used Data::Tubes to address an issue.
Iâm helping a friend deal with an issue regarding some not-so-complicated data handling and one of the files that we had to parse has a 80-ish format with fixed fields in specific lines.
The entire format is according to an Italian specification about the so-called CBI format. The bottom line is that there areâŚ
- ⌠records that are formed by multiple lines, each with its own type and data
- ⌠lines have a fixed 120-characters-long format, where characters means bytes.
Itâs been an occasion for me to resume Data::Tubes and see if past me has been gentle with present me. As always, docs can be enhanced but overall I had a good experience with getting up to speed again after forgetting almost everything (apart from the underlying generic philosophy).
The main pipeline I came up with is the following:
my $p = pipeline(
'Source::open_file',
['Reader::by_line', emit_eof => 1],
line_tracker(),
\&parse_cbi_line,
cbi_aggregator(),
[
'Plumbing::dispatch',
key => 'type',
factory => sub { return \&nop }, # ignore stuff by default
handlers => {
disposizione => \&aggregate_disposition,
},
],
sub ($r) { $r->{info} }, # keep info field only for output
{tap => 'array'},
);
This pipeline expects to receive a file name as an input and start processing it:
- input files are opened;
- each file is read by the line - this âexplodesâ each input record (an opened file) into multiple output records (one per line);
- the
line_tracker
only serves the purpose of tracking the line number, should it be necessary (e.g. if nobody asked for it) - the
cbi_aggregator
takes multiple input records - corrisponding to the different pieces of information for a record - and coalesces them into a single record per operation (comprising multiple input parts) - the plumbing part is to make sure that we only keep what weâre really interested into (full records pertaining a specific operation, not every possible one) while ignoring the rest (the factory by default ignores the record)
- the last tube gets rid of most of the âhousekeeping stuffâ and retains
the interesting part of the record, that is whatever has been
collected into
info
. - Last, weask the pipeline to return an array instead of throwing everything down the sink.
There are a few rough edges here and there in the docs - e.g. it would be good if I documented a bit better what comes in and out of some functions, and add more examples about their intended usage. All in all, anyway, I have to thank past me for making it easy for me to use Data::Tubes đ
Stay safe and have fun, folks!