Command Line Ride
When your harddrive is running out of space, chances are good that there are files you can safely delete in order to free up some of that space. A safe bet are archive files, like ZIP and RAR-files, you’ve already decompressed.
I went on a little ride to put together a command that shows me how much space a
certain type of file take up. There first thing I did was find
ing those
files…
find / -type f -name '*.rar'
What does find
do here? It looks up all the files on the harddrive that are
really files (-type f
) and not directories and whose names end in .rar
. That
got me pretty far. But still not far enough: sometimes RAR-archives are splitted
up into several files called archive.rar archive.r01 archive.r02
and so on.
They should be listed as well:
find / -type f -regex '.*r[0-9][0-9]' -o -name "*.rar"
That’s better! Here find
lists the filenames matching the provided regex or
(-o
) those ending in .rar
. Running this command in a directory containing
such an array of RAR-files outputs the following:
$ find . -type f -regex '.*r[0-9][0-9]' -o -name "*.rar"
./big_archive.r01
./big_archive.r02
./big_archive.r03
./big_archive.r04
./big_archive.r05
./big_archive.rar
But I still didn’t know how big those files are. So the first thought was: use
ls -l
on every file since that gives me the size of every file. But ls
takes
filenames as command line arguments and doesn’t read from the standard input
stream. So I couldn’t just pipe the list to ls
, since a Unix pipe connects the
standard output stream of one program to the standard input of another program.
Just try it:
find . -name '*.txt' | ls -l
That shouldn’t give you the desired output. What happens here is that ls
doesn’t get an argument and lists the contents of the current directory. So how
does one call ls -l
on every file in the list above? Pipe the list of files to
xargs
. xargs
’s job is to construct argument lists for the provided command.
It does so by splitting up the data it receives on the standard input and using
each chunk as an argument. By default xargs
splits the incoming data by
newlines or blanks, which is normally fine but could lead to problems when
find
outputs a filename containing whitespaces. In that case, be sure to use
man find
and man xargs
: You can specify a delimiter other than blank or
newline. So far the output should look like this:
$ find . -type f -regex '.*r[0-9][0-9]' -o -name "*.rar" | xargs ls -l
-rw-r--r-- 1 mrnugget staff 1024000 Oct 18 13:58 ./big_archive.r01
-rw-r--r-- 1 mrnugget staff 1024000 Oct 18 13:58 ./big_archive.r02
-rw-r--r-- 1 mrnugget staff 1024000 Oct 18 13:58 ./big_archive.r03
-rw-r--r-- 1 mrnugget staff 1024000 Oct 18 13:58 ./big_archive.r04
-rw-r--r-- 1 mrnugget staff 1024000 Oct 18 13:58 ./big_archive.r05
-rw-r--r-- 1 mrnugget staff 1024000 Oct 18 13:57 ./big_archive.rar
Great, I thought, now I just need to get all the different filesizes, add them together
and print the total sum! Shouldn’t be too hard, right? Well, it isn’t if you got
awk
. awk
has too numerous capabilities to explain in this blog post. So
let me make it short: awk
read its input from either the STDIN or from files
passed in as arguments and then performs actions on matching lines. To make it
even shorter: awk
is awesome. There is a lot of free information available on
the internet about awk
, but a single man awk
goes a long way.
Using awk
to output the sum of the filesizes looks like this:
$ find . -type f -regex '.*r[0-9][0-9]' -o -name "*.rar" | xargs ls -l | awk '{sum = sum + $5} END {print sum}'
6144000
Here awk
takes the fifth field (the fields are by default separated by blanks)
and increments the variable sum
by it. At the end of the awk
-program (after
awk
ran it over each line) it prints out the sum, which gives us the sum of
the filesizes. But that’s not really readable since the output of ls -l
contains filesizes in bytes and I think it’s safe to say that megabytes would be
far more handy in this case. So I had to divide the sum by 1024 to get
kilobytes and then again by 1024 to get megabytes and I did this with the help
of xargs
and bc
:
$ find . -type f -regex '.*r[0-9][0-9]' -o -name "*.rar" | xargs ls -l | awk '{sum = sum + $5} END {print sum}' | xargs -I sum echo sum/1024/1024 | bc -l
5.85937500000000000000
That looks great! So what happens here? xargs
uses a name for the data it
reads from standard input, sum
, and then the echo
command to output the
calculation that needs to be fed to bc
. Without the bc
the command above
would just output 6144000/1024/1024
. bc
then takes this as input and gives
us the result. Be sure to do this: man bc
. This example here doesn’t even
scratch the surface of what bc
is capable of.
So now the job is done, right? The command line above now outputs the total size of all the RAR-files on the harddrive or in the current directory. Well, technically yes, it’s done. But as you can see, that was pretty heavy lifting, nobody will remember that command above and when first looking at it nobody will know what it exactly does.
And here’s the kicker: it’s useless. That command above is obsolete. As soon as
I finished hacking up that command line I remembered a tool that I basically use
every day but totally forgot about while hacking together the right find
-regex,
looking up awk
formulas and how bc
works. There is one tool that does
exactly what that long line above does and it’s called du
. du
is built for
the job. It’s a simple tool that does one thing very well (and I quote the man
page here) and that is to “display disk usage statistics”. I can’t for the life
of me explain how I forgot it. With du
in hand, the line shrinks down to this:
$ find / -type f -regex '.*r[0-9][0-9]' -o -name "*.rar" | xargs du -ch
1000K ./example_dir/big_archive.r01
1000K ./example_dir/big_archive.r02
1000K ./example_dir/big_archive.r03
1000K ./example_dir/big_archive.r04
1000K ./example_dir/big_archive.r05
1000K ./example_dir/big_archive.rar
5.9M total
That looks a lot better than that monster I hacked together. And it’s easier to
understand too: 1) find all the files matching a certain pattern 2) then pass
them to du
to display how much space they take up. After remembering du
I
thought: “Well, maybe I can shrink this down even further”. And I did.
When you use find
and the -regex
option you’re in for a challenge. find
allows you to use many different types of Regular Expressions
and the differences sometimes make it really difficult and frustrating to get
the regex you want to work. Just have a look here.
Most of the time it’s probably easier to use the globbing functionality of
your shell rather than using find
and regex. Especially ZSH is pretty good with globbing
and bash also does its job very well. Since I’m using ZSH I tried to get rid of find
and use my shell’s
built-in globbing functionality. And what I came up with is so much better than
that long line above:
$ ls **/*.r(ar|<-99>) | xargs du -ch
1000K example_dir/big_archive.r01
1000K example_dir/big_archive.r02
1000K example_dir/big_archive.r03
1000K example_dir/big_archive.r04
1000K example_dir/big_archive.r05
1000K example_dir/big_archive.rar
5.9M total
That line recursively lists all files ending in either rar
or r01
up to
r99
and then passes them over to du
. Easy to read, easy to understand
and more importantly: easy to reuse.
There is a lot I learned on this ride and most importantly it was seeing the Unix philosophy in action:
“This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.” - Doug McIlroy
All of those programs work very well together, they are combinable, they are
reusable and they do one thing well. And in this case du
was that program that
did the one thing I wanted to achieve very well and could be used to replace
another complex “program”, if you want to call that line above a program.
That is not to say that complex command lines are always wrong to use, no. Sometimes you need that many programs to work together in order to get the desired output. And when that happens it’s great to see how good every tool is at doing its own job and which problems can be solved by combining them. And Unix pipes make it dead easy to combine them by offering a clean and easy to understand inteface.
Seeing that philosophy in action shows extremely well how much code and programs can profit from complying with it.