TL;DR

Somehow niche, but Text::Gitignore hits the nail right in the head when you need that kind of functionality.

In dibspack-basic (the companion to dibs) I wanted to include a utility function to copy files from a build phase, preparing them for the bundling phase.

If you don’t know what I’m talking about, a little recap: dibs is a utility to streamline producing Docker images, and my process usually takes two steps:

  • in the build step, I work in an image where I install all tools that support the build process, e.g. a compiler, development versions of libraries, ancillary tools, etc., without worrying too much about bloat;

  • in the bundle step, I strive to create the tightiest image possible, only including artifacts that are strictly necessary at runtime.

Communication across these two steps happens through a shared cache directory where the build process saves compiled artifacts, and the bundle process takes them to the final destination.

If I start from a distribution e.g. on GitHub, I might not want to include everything in the distribution inside the final Docker image. As an example, the .git directory used by git to track files is not necessary; for Perl programs, a cpanfile/cpanfile.snapshot pair of files only makes sense at build time, not at runtime. Hence, for dibspack-basic I needed a mechanism that allowed me to exclude unwanted files/directories.

I was about to code something when something hit me: there must be already something in CPAN! And sure there is: Text::Gitignore. To be fair, it’s not the only one, but I tried it and was pretty happy about it, so I didn’t feel the need to evaluate anything else.

The module provides a basic but helpful function to create a matcher anonymous sub, i.e. a sub that accepts a file path and tells you whether it matches or not some patterns written according to the rules of a .gitignore file. This format is particularly attractive because it’s what git uses, so people should have no surprises when told to adopt the same exact approach.

Example usage: traversing a directory tree

When you want to use it in a filesystem tree, you have to do some coding of your own, something like this (we’re relying upon Path::Tiny because it’s soooo useful):

 1| use Path::Tiny 'path';
 2| use Text::Gitignore 'build_gitignore_matcher';
 3| use constant IGNOREFILE => '.ignore';
 4| #...
 5| sub find_files {
 6|    my $root = path(shift);
 7|    my @matchers = @{shift || []};
 8| 
 9|    # if this directory has a .ignore file, load it and prepare to match
10|    # stuff
11|    my $ignore = $root->child(IGNOREFILE);
12|    if ($ignore->exists) {
13|       my $matcher = build_gitignore_matcher([$ignore->lines({chomp => 1}))];
14|       push @matchers, [$root, $matcher];
15|    }
16| 
17|    # now traverse the directory
18|    my @output;
19| CHILD:
20|    for my $child ($root->children) {
21| 
22|       # skip if matched by any matcher, either in the current $root
23|       # directory or any applicable parent traversed so far
24|       for my $pair (@matchers) {
25|          my ($matcher_root, $matcher) = @$pair;
26|          my $relative_path = $child->relative($matcher_root);
27|          next CHILD if $matcher->($relative_path);
28|       }
29|       push @output, $child;
30| 
31|       # recurse, if it's a directory
32|       push @output, find_files($child) if $child->is_dir;
33|    } ## end CHILD: for my $child ($root->children)
34|    return @output;
35| } ## end sub find_files

This recursive function takes care to load all .ignore files it finds in the tree (or whatever you put in constant IGNOREFILE, anyway), sticking to the git convention of honoring all .gitignore files in the project tree, even those in sub-directories.

For this reason, at each directory it has a list of @matchers (line 7) inherited from previous ancestor directories, which will take care of checking files on the way. This list is enlarged only when possible, i.e. when a .ignore file is present (lines 11-15).

The list of children in the directory is scanned (line 20), first checking for possible skips (lines 24-28). In this phase, all matchers are analyzed, although the path they are passed must be adjusted to the root path they were created in the first place (line 26).

If the file is allowed to pass, it’s added to @output (line 29) and, if a directory, the whole process is recursed within it (line 32).

One caveat about build_gitignore_matcher: if you want to pass a list of lines, pass it as a reference to an array. Otherwise, only the first will be considered, for a good 10 minutes of confusion and puzzling!

What might be missing

The behavior show in the previous section is not 100% the same as what you would have in a .gitignore file. Consider the following situation:

.gitignore          --> [cia*]
ciao
subdir1/ciao
subdir2/.gitignore  --> [!ciao]
subdir2/ciao

In this case, the top-level .gitignore would make git ignore all ciao files. Except that subdir2/.gitignore explicitly instructs git to not ignore the ciao file there. So, by git standard, the output list would be the following:

.gitignore
subdir2/.gitignore
subdir2/ciao

but our program would happily ignore the subdir2/ciao file.

This seems unavoidable at this point: the matcher function currently only tells whether something should be excluded, not if it should be included instead. In other terms, it should return one of three values: exclude, indifferent, and include; in the indifferent case, parent values would apply.

Maybe a patch is due… Let me know your thinking!