ETOOBUSY 🚀 minimal blogging for the impatient
Variables, loops, and redirections
TL;DR
Sometimes variables in a shell can bite you when used in loops with redirections.
My colleague A. uses the shell to solve actual problems (as opposed as using it as an excuse to write blog posts 🙄) and sometimes comes with interesting questions.
In the last days, he was biten by the following classical issue:
generate_data() { printf 'yadda\nyadda\nyadda\n' ; }
count=0
generate_data | while read input ; do
count=$((count + 1))
# do something with $input...
done
printf 'count = <%d>\n' "$count"
In many shells, the printf
in the last line will print:
count = <0>
How come it’s not 3
? I.e. how come it was not incremented in the while
loop?
A choice of multiple processes
The key element to understand what’s going on is that the output of
generate_data
is fed into the while
loop using a pipe operator (i.e.
the |
characters):
generate_data | while read input ; do
In general, this can be expressed as:
left_command | right_command
To implement this, the shell will spawn a different sub-shell, so that one part of the pipe is executed in the current shell, and the other part is executed in the other shell.
At this point, it’s up to the actual implementation of the shell to decide
which part is kept in the current shell. In bash and dash, the
left_command
command wins, so in our example the while
loop is executed
inside the other sub-shell.
As a consequence, the count
variable that is initialized before the pipe
of commands is copied into the sub-shell of the while
, but after this
copy there are two count
variables and they are not connected any more.
When the pipe ends… the count
variable inside the sub-shell executing
the while
loop is lost for good.
So… we have to look for alternatives.
Move count
closer to the loop
One way to address this issue is to keep control over the count
variable,
making sure that the one we initialize remains the same as the one we
increment in the loop and then print in output. Curly braces can help us
keep all these things together:
generate_data | {
count=0
while read i ; do
count=$((count + 1))
# ...
done
printf 'count = <%d>\n' "$count"
}
This works fine if we can delimit the scope where we need to use the
count
variable, i.e. if we don’t need it later for some other reason.
A variant of this approach would be to put all the instructions inside a
separate shell function; this has the added advantage of letting us be very
descriptive as to what the expected scope of the count
variable should be,
by means of local
:
process_data() {
local count=0
while read i ; do
count="$((count + 1))"
# ...
done
printf 'count = <%d>\n' "$count"
}
generate_data | process_data
Move the loop closer to count
If the usage of count
spans over multiple lines of code, possibly with
other data taken as input, the technique in the previous section might not
be helpful or easy to use.
So… if we can’t put the count
variable in the sub-process, we might
manipulate code to extract the while
out of the sub-process, right?
One way to do this is to avoid the pipe completely and find a different way
to feed the while
input with the output from generate_data
.
Here-documents can help us with this:
count=0
while read i ; do
count=$((count + 1))
# ...
done <<END
$(generate_data)
END
printf 'count = <%d>\n' "$count"
The idea is simple: redirecting the input of a command does not trigger
running the command in a sub-shell. So this keeps the while
in the same
scope as the variable count
that is initialized before it and printed
after it.
The generate_data
, though, is called in a sub-shell this time: inside the
here-doc there is a call with $( ... )
which does exactly this. Its
output is expanded in the here-doc and then fed as input to the while
loop. Job done!
Here-strings (bashisms)
If you’re using the bash shell, you can trim off some characters off of the here-doc solution in the previous section by means of a here-string:
count=0
while read i ; do
count=$((count + 1))
# ...
done <<<"$(generate_data)"
printf 'count = <%d>\n' "$count"
It results in shorter and, in my opinion, easier to read code.
But if you really have bash, why not use…
Process subtitution
In decently recent Linux releases (and many more other operative systems, I guess!) it’s possible to leverage process substitution, which is probably a cleaner way to pass data than using here-documents or here-strings:
count=0
while read i ; do
count=$((count + 1))
# ...
done < <(generate_data)
printf 'count = <%d>\n' "$count"
Note that there is a space character between the two <
characters,
because the first one tells the shell to get the standard input for the
while
loop from a file (i.e. it is a plain, boring redirection operator)
and the second one implements process substitution, materializing that
input file. Neat!
One very interesting characteristic of process substitution is that it lets us turn many sub-commands into files at the same time, and feed all of them as input to a single command. As an example, this will work as one might expect:
diff <(command_1) <(command_2)
The diff
command is fed two files that it will be able to open and
read from.
Curious about these files? Let’s take a look at them:
$ print_args() { printf '1<%s>\n2<%s>\n' "$1" "$2"; }
$ print_args <(date) <(ls / | grep a)
1</dev/fd/63>
2</dev/fd/62>
So many ways to choose from
At this point, I guess I’m out of alternatives. There will surely be a lot more - like… saving the output in a file and then consuming it afterwards with standard input redirection - but I guess that the alternatives above should fit in almost all situtations.
Stay safe and have a good day!