help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Most used words in current buffer


From: Bob Proulx
Subject: Re: Most used words in current buffer
Date: Wed, 18 Jul 2018 18:45:36 -0600
User-agent: Mutt/1.10.0 (2018-05-17)

Ben Bacarisse wrote:
> Udyant Wig writes:
> > they were left behind by this old Awk solution (also using hashing) I

Not wanting to be too annoying but I see no hashing in the awk
solution.  It is using an awk associative array to store the words.
Perl and Pything call those "hashes" but they are just associative
arrays.

> > found in the classic /The Unix Programming Environment/ by Kernighan and
> > Pike:
> >...
> > awk '    { for (i = 1; i <= NF; i++) num[$i]++ }
> > END      { for (word in num) print word, num[word] }
> > ' $* | sort +1 -nr | head -10 | awk '{ print $1 }'
> >
> > I appended the last awk pipeline to only give the words without the
> > counts.
>
> The Unix command cut does this task.  Nothing wrong with using another
> awk, but I often feel sorry for poor old cut.  It's been around for
> decades, and yet is so very often overlooked!  Mind you, it uses TABs to
> delimit fields by default, so maybe it only has itself to blame.

I will continue to be contrary here and say that awk does a much
better job of cutting by whitespace separated fields than does cut.
Both are standard and should be available everywhere.  And here
because awk is already in use I expect it to be somewhat more
efficient to use awk again in the pipeline than to use a different
program.

I also wish to improve the command line somewhat.  Using $* by itself
does not sufficiently quote program arguments with whitespace.  One
should use "$@" for that purpose.  Also the old forms of sort and head
would be better left behind and use the new portable option set
for them instead.  Let me suggest:

  ' "$@" | sort -k2,2nr | head -n10 | awk '{ print $1 }'

Bob



reply via email to

[Prev in Thread] Current Thread [Next in Thread]