help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Most used words in current buffer


From: Bob Proulx
Subject: Re: Most used words in current buffer
Date: Thu, 19 Jul 2018 01:04:38 -0600
User-agent: Mutt/1.10.0 (2018-05-17)

Udyant Wig wrote:
> Bob Proulx wrote:
> > Not wanting to be too annoying but I see no hashing in the awk
> > solution.  It is using an awk associative array to store the words.
> > Perl and Pything call those "hashes" but they are just associative
> > arrays.
> 
> I understand that associative arrays in awk are built upon hashing.
> Kernighan and Pike say

Hmm...  I think looking behind the abstract data type at how it might
or might not be implemented is a stretch to say the least.  The entire
reason it is an abstract data type is to hide those types of
implementation details.  :-)

> However, on the previous page, in introducing the language construct,
> they do take the name _associative array_.

The naming of things usually says more about the person that named
the thing than the thing itself.  Associative arrays is a naming that
reflects the concepts involved in what it does.  This is the same as
when someone else names it a map table, or a dictionary.  Those are
all the same thing.  Just using different names because people took
different paths to get there.

>   The implementation of associative memory uses a hashing scheme to
>   ensure that access to any element takes about the same time as to any
>   other, and that (at least for moderate array sizes) the time doesn't
>   depend on how many elements are in the array.

Hash tables are attractive because in their perfect form they are O(1)
order of the function being a fixed amount for any lookup.  Being a
fixed amount sounds very good and therefore authors often implement
using hash tables.  However achieving that perfect O(1) performance
requires knowing the size of the needed hash table at creation time so
that it can be sized correctly from the start and that is not the case
here.  Therefore when implemented as a hash table the table will need
to be grown at various intervals as the number of entries contained is
increased.  That growth is lumped and consumes time and resources for
the growth at the time the table is grown.

For such things I generally prefer balanced tree structures because
work is amortized instead of lumped.  But the important point here is
that for every algorithm + data structure there is a trade-off of some
sort between one thing and another thing.

It was in the Perl community that these became known as "hashes" due
to the implementation.  But I think it is better to keep the ADT
(abstract data type) abstract.

> >   ' "$@" | sort -k2,2nr | head -n10 | awk '{ print $1 }'
> 
> Thank you for the portable pipeline.  It is interesting to compare it
> with the pipeline given in the book:
> 
>   $ wordfreq ch4.* | sort +1 -nr | sed 20q | 4
> 
> where
> 
>   wordfreq is the awk script proper,
>   4 is a shell script that prints its input in 4 columns,
>   and sed 20q does the equivalent of head -20.
> 
> On the last point, they say that given the ease of typing a sed command,
> they felt no need to write the program head.

I am in total agreement over using sed instead of head if you want to
do that.  Seeing 'sed 20q' should roll off the keyboard as print lines
until line 20 and then quit.  Very simple and to the point.  There is
definitely no need for a separate head command.  Other than for
symmetry with tail which is not as simple in sed.

I didn't go look up the reference in the book before posting my
first reply.  I probably should have refreshed my knowledge of it.
Then I would have realized that we weren't talking about literal
hashing of tokens (like what is done in SpamAssassin's Bayes engine
for one example) but rather use of "hash" as a table lookup method.
That caused me to say that first.  Sorry.

As to the old style command line options for head and sort I was just
wanting to keep in people's minds that since a decade ago or so the
new format of the command line options is now the preferred one.  That
old style is certainly how we wrote those things back in the day
though.

Bob



reply via email to

[Prev in Thread] Current Thread [Next in Thread]