Bash

pi-_
Can anyone help me understand the syntax used in the answer here: http://stackoverflow.com/questions/24341618/bash-script-to-find-files-that-match-by-content ?
pi-_

    find . -name '*.wav' -exec cksum '{}' + | awk 'NR == FNR {ck[$1] = $3; next} {if ($1 in ck) print ck[$1], $3}' cksum.txt -

pi-_
It looks like it is doing: find . -name '*.wav' -exec foo
waldner
so it reads in all of cksum.txt
waldner
well, columns 1 and 3
pi-_
ah ok
pi-_
So a sensible debug step from me would be to look at the output of: find . -name '*.wav' -exec cksum '{}'
waldner
then the output coming from the find command is processed, and on lines whose first field appeared in cksum.txt, that field and the third is printed
pi-_
It is starting to look a bit less scary now…
waldner
the output of that is used as "second file" input to awk
waldner
the "-" stands for "read stdin"
pi-_
What is '{}' doing?
waldner
it's replaced by the file name found by find
pi-_
And why is there a + before | ?
waldner
basically find is outputting a list of checksum and file names
waldner
it's the syntax of find
waldner
it can be a ; or a +
waldner
when using ;, the command (cksum in this case) is invoked once per file
pi-_
I thought ; was used to separating multiple commands on the same line…
waldner
that too, but not with find
waldner
or better, yes, in fact when you use it with find you have to escape it
waldner
…. -exec command {} \;
waldner
otherwise it's taken as command separator by the shell
waldner
if you use + instead of ;, the command (cksum) is run with multiple arguments instead, as many as fit on the command line
waldner
so find is outputting a list of checksum and file names
sqlnoob has left IRC (Ping timeout: 264 seconds)
waldner
then awk is used to print those among them whose checksum also appears in the cksum.txt file
waldner
(presumably a list built from a previous run)
pi-_
yes!
waldner
so if find outputs
waldner
a b file1
waldner
c d file2
waldner
and in cksum.txt we have (for example)
waldner
x y zzzz
waldner
a b nnnnnn
waldner
the "a" matches, so awk will print
waldner
a nnnnnn
waldner
where "a" is an actual checksum
waldner
but the idea is the same
pi-_
find . -name '*.wav' just produces a list of filenames, one per line.
waldner
yes, but then they are passed to cksum
pi-_
If I run that with: find . -name '*.wav' -exec cksum '{}' +
waldner
by the -exec {} + call
waldner
so again
pi-_
Now it is giving three columns
waldner
yes
waldner
that's standard cksum output
waldner
checksum, size and file name
pi-_
aah ok
waldner
so if find finds file1, file2, file3 etc.
waldner
the -exec actually runs: cksum file1 file2 file3 …
waldner
and *that* is the final output of find
waldner
which is piped to awk and used as second file
pi-_
ok I see, so the '+' is concatenating the entire output of 'find'
waldner
not exactly
waldner
it's just taking as many filenames as possible, and run "cksum" passing that list of files as arguments
waldner
if it's not possible to pass all of them at once, then "cksum" is run multiple time, until it has run over all the files
waldner
find takes care of doing all that with -exec …. {} +
waldner
it basically does a similar thing to xargs, if you know it
waldner
(not exactly the same)
pi-_
ok, now I can understand everything to the left of the pipe
Left_Turn has joined (~Left_Turn@unaffiliated/turn-left/x-3739067)
waldner
so awk is being invoked with two args
waldner
both are file nnames, although one os special
pi-_
ah, 'cksum.txt' and '-' ?
waldner
the NR==FNR part is a condition that is only true while awk is reading the first file
waldner
yes
waldner
so NR==FNR{…something …}, that "something" is only run while awk is reading cksum.txt
waldner
that something is: ck[$1] = $3
waldner
so it creates an element in the associative array "ck", whose key is the checksum and the value is the file name
waldner
(there's a flaw there, but it's not important now)
waldner
so if you had a line in the file like this:
waldner
12345 44 file1.wav
waldner
awk is doing: ck["12345"] = "file1.wav"
waldner
and so on for each line of cksum.txt
pi-_
I get it! Yes, I've come across Dictionary objects like this in Python and C#
waldner
so after awk has read all cksum.txt, the cksum array will have multiple elements
waldner
yes, same thing
pi-_
So it is creating this dictionary on-the-fly
waldner
yes
waldner
at that point, it starts reading the "-" file, which means, stdin
waldner
ie, the output from find
pi-_
Until cksum.txt is exhausted…
waldner
which, not coincidentally, has the same format
waldner
so if $1 exists as a key in ck …. (the "if ($1 in ck)" part
pi-_
That's really slick
waldner
it prints the checksum, and $3
waldner
that is, the file name with the same checksum as coming from find
pi-_
Fantastic, thanks very much for that introduction to awk!
pi-_
You said there was a flaw, is there a better way to do this operation?
waldner
of course, since awk splits fields with whitespaces, there's a problem is some file has a space in its name
waldner
ie, if a file is "file foo.wav"
waldner
then running cksum on it we get:
waldner
12345 5432 file foo.waw
waldner
in this case, $3 is not the whole file name, instead it's only "file"
waldner
whereas we want "file foo.wav"
pi-_
Right, so $3 needs to be replaced with '$3 to the end of the line…' somehow
waldner
so instead of taking $3, one should just take whatever comes after the second space until the end of the line
waldner
yes
waldner
of course, if your file names have no spaces, $3 works fine
waldner
but it's not 100% safe
pi-_
How to do that?
pi-_
Is it fiddly?
waldner
with awk, you can just use sub() to remove everything that comes before
waldner
eg sub(/^[0-9]+ [0-9]+ /, "")
waldner
what's left must be the file name
waldner
so the code becomes:
waldner
find . -name '*.wav' -exec cksum '{}' + | awk 'NR == FNR {ck[$1] = $3; next} {if ($1 in ck) { sub(/^[0-9]+ [0-9]+ /, ""); print ck[$1], $0}' cksum.txt -
waldner
more things:
waldner
you don't actually need to save the $3 from cksum.txt
waldner
so you can simply do
waldner
find . -name '*.wav' -exec cksum '{}' + | awk 'NR == FNR {ck[$1]; next} {if ($1 in ck) { sub(/^[0-9]+ [0-9]+ /, ""); print ck[$1], $0}' cksum.txt -
waldner
the "ck[$1]" here already creates the key in the dictionary, so you don't need to assign a value to it
waldner
then
waldner
awk programs have (or should have) a " condition { action } " structire
waldner
so the if ($1 in ck) can be rewritten more idiomatically to be a condition:
waldner
find . -name '*.wav' -exec cksum '{}' + | awk 'NR == FNR {ck[$1]; next} $1 in ck { sub(/^[0-9]+ [0-9]+ /, ""); print ck[$1], $0}' cksum.txt -
waldner
ah, I have introduced an error, wait
waldner

find . -name '*.wav' -exec cksum '{}' + | awk 'NR == FNR {ck[$1]; next} $1 in ck { cksum = $1; sub(/^[0-9]+ [0-9]+ /, ""); print ck[cksum], $0}' cksum.txt -

waldner
otherwise $1 is lost after the sub()
waldner
finally, it fails if your file names contain newlines (in which case you deserve that not only this code, but many others fail as well anyway)
pi-_
haha I've never come across a newline in a file
waldner
in a file hopefully yes; I'm talking about filenames
pi-_
^name woops
pi-_
Do NR and FNR stand for something? I guess F is for "First"
waldner
NR is the total number of records (lines) awk has read, from the beginning
waldner
FNR is the number of records that awk has read from the current file it's processing
waldner
it's reset every time awk finishes reading a file and goes on to the next
pi-_
oh so that's actually rather cunning way of ensuring we're on the first file
waldner
so NR == FNR is true only while awk is reading the first file
waldner
yes
waldner
you could say eg FILENAME == "cksum.txt" to get the same result
waldner
but NR == FNR is the typical way to do that
waldner
you don't depend on file names etc
molqr has joined (~01.83.931.41|m#01.83.931.41|m)
molqr has left IRC (Changing host)
molqr has joined (~m@unaffiliated/molqr)
waldner
fILENAME == ARGV[1] is another way, that doesn't depend on file names, but I prefer NR == FNR
waldner
it's the most used idiom
pi-_
What is going on with that ';next' ? I suppose it spins within the {} until the outside condition fails…?
waldner
next tells awk to immediately go to the next input record (line)
waldner
so basically it's a way to ensure that what comes after is not executed
waldner
in this case, that that "$1 in ck { …." part is not executed
waldner
since that has to run while awk reads the second file, not the first
pi-_
oh so it's CONDITION {execute this} {otherwise execute this}
waldner
well, awk just evaluates the code from left to right
waldner
if there's a condition, it's evaluated and if it's true, the associated code is executed
waldner
if there's no condition, the code in {} is executed unconditionally
waldner
here both code blocks have conditions
waldner
one has NR == FNR, the other has $1 in ck
pi-_
ah ok, I was looking back at the original
pi-_
I get it
waldner
so if we don't say otherwise, awk just goes on and evaluates all of them
waldner
in the original there's an if() inside that does the same thing
pi-_
I'm looking at: sub(/^[0-9]+ [0-9]+ /, "")
waldner
ok, so after the find and awk tutorials, let's move on to the regex one
pi-_
I take it this is doing "wild card followed by space followed by wild card followed by space followed by FOO" is getting replaced by "FOO"
waldner
no
waldner
whatever matches the pattern inside /…/, is replaced by "", that is, nothing
waldner
ie, it's deleted
waldner
the expression says: ^ (beginning of line)
waldner
[0-9]+
waldner
one or more digits
waldner
(a space)
pi-_
oic
waldner
then again one or more digits, then another space
pi-_
gah I forgot the first two entries were numerical
waldner
so if we have 1234 12 xxxxxxxxxx
waldner
the regex matches the "1234 12 " part
pi-_
yes, I get it!
waldner
which is deleted
pi-_
That's amazing, thank you!
waldner
np
pi-_
You've picked apart every symbol in that whole expression
pi-_
Can I paste a transcript on that stack overflow page?
waldner
it's all pretty basic stuff actually
waldner
there are zillions of documents in the internet that explain all that in an even better way
waldner
just google
pi-_
sure, but that would've taken hours of googling
pi-_
And its quite difficult to google syntax
waldner
that's how one learns
pi-_
yep
waldner
find tutorial: http://mywiki.wooledge.org/UsingFind
waldner
regex tutorial: http://rexegg.com/
waldner
(you only need the second page to understand the regex used here)
waldner
for awk, see the channel topic for tons of good docs

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License