help-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Help-bash] Using PE to specify an array


From: Greg Wooledge
Subject: Re: [Help-bash] Using PE to specify an array
Date: Mon, 17 Sep 2018 12:19:41 -0400
User-agent: NeoMutt/20170113 (1.7.2)

On Mon, Sep 17, 2018 at 11:36:25AM -0400, Bruce Hohl wrote:
> @Greg, it is an interesting happen-stance that you replied as my question
> arose from my pass at completing your duplicate file finder "exercise" at
> mywiki.wooledge.org/BashProgramming/04:  "If you want to "fix" this
> "problem", you might suppress all the printing until the end, and then
> iterate over the whole array and print only those values that contain a
> newline. (This is left as an exercise.)"  So with your suggestion to use
> nameref vars the following seems to work:
> 
> === Duplicate file finder exercise === (NO comments)
> #!/bin/bash
> while read -r md5_hash file; do
>   var_hash=md5_$md5_hash
>   declare -n ind_var_hash=$var_hash
>   [[ address@hidden -eq 1 ]] && declare -a dup_array+="($var_hash)"
>   declare -a ${!ind_var_hash}+="('$file')"
> done < <(find "${1:-.}" -name $'*\n*' -prune -o -type f -exec md5sum {} +)
> 
> declare -n e
> for e in address@hidden; do
>   echo ${!e}
>   for f in address@hidden; do echo "  $f"; done
> done

So your approach was to experiment with bash commands until you found
something that would approximate giving you the ability to have a hash
of lists (associative array of indexed arrays).

And what you came up with was using the entire bash variable namespace
as your hash, and storing each list as a separate indexed array within
that namespace.

That's... definitely not how I would have done it. ;-)

You're also missing some quotes.

Anyway, here is the solution that I had in mind for that:

=====================================================
#!/bin/bash
declare -A seen
while read -r md5 file; do
  if [[ ${seen[$md5]} ]]; then
    seen[$md5]+=$'\n'$file
  else
    seen[$md5]=$file
  fi
done < <(find "${1:-.}" -name $'*\n*' -prune -o -type f -exec md5sum {} +)

for i in "address@hidden"; do
  if [[ ${seen[$i]} = *$'\n'* ]]; then
    printf 'Matching MD5:\n%s\n\n' "${seen[$i]}"
  fi
done
=====================================================

The stuff I wrote in the text was really quite literal: "store multiple
filenames for each MD5 value (in a newline-delimited pseudo-list)" and
"iterate over the whole array and print only those values that contain
a newline".  That's what I'm doing here.

This is also a hack, using newlines to store multiple elements of a list
in a string variable, and this only works because we're already excluding
filenames that have a newline in them.  This frees up the newline character
to act as a list delimiter.

In the absence of that opening, I would simply have written the program
in a different language -- one that allows you to create a hash of lists
without needing special hacks and tricks.

For example, a relatively straight conversion to Tcl:

=====================================================
#!/usr/bin/env tclsh
if {[llength $argv]} {set start [lindex $argv 0]} else {set start .}
foreach line [split \
      [exec find $start -name "*\n*" -prune -o -type f -exec md5sum "{}" +] \
      \n] {
  set md5 [string range $line 0 31]
  set file [string range $line 34 end]
  lappend seen($md5) $file
}

foreach i [array names seen] {
  if {[llength $seen($i)] < 2} continue
  puts [format "Matching MD5: %s" [join $seen($i) { }]]
}
=====================================================

The output format is slightly different, but of course that can
be adjusted.  The elements of "seen" are simply lists of filenames,
as this language supports this directly.  I'm sure a similar solution
could be written in Python (which I don't know well enough to write in).

The only reason this solution is excluding filenames with newlines is
because of the md5sum command's output format.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]