[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Most used words in current buffer

From: Udyant Wig
Subject: Re: Most used words in current buffer
Date: Wed, 18 Jul 2018 15:06:56 +0530
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1

On 07/18/2018 12:11 AM, Emanuel Berg wrote:
> Do it!
> But if you can let go of the Elisp requirement here are some examples
> how to do it with everyday GNU/Unix tools:

I went ahead and did it.  I obtained many solutions, in fact.  Only
today did I check the link above.

First, of the solutions in Emacs Lisp, this one came out as the

(defun buffer-most-used-words-1 (n)
  "Make a list of the N most used words in buffer."
  (let ((counts (make-hash-table :test #'equal))
        (words (split-string (buffer-string)))
    (dolist (word words)
      (let ((count (gethash (downcase word) counts 0)))
        (puthash (downcase word) (1+ count) counts)))
    (loop for word being the hash-keys of counts
       using (hash-values count)
         (push (list word count) sorted-counts)
       finally (setf sorted-counts (cl-sort sorted-counts #'>
                                            :key #'second)))
    (mapcar #'first (cl-subseq sorted-counts 0 n))))

Briefly, it obtains a list of the strings in the buffer, hashes them,
puts the words and their counts in a list, sorts it, and lists the first
N words.  (I had also written solutions (1) using alists; (2) using the
handy AVL tree library I found among the Emacs Lisp files in the Emacs
distribution; and (3) reading the words directly and hashing them.  None
beat the above.)

The function is suffixed with '-1' because it is the the core of
another, interactive function, which takes the above generated list and
displays it nicely in another buffer.

I was curious about possible solutions in other languages.  I wrote
programs in both Common Lisp and Python, based on the essential hash
table approach.  While a lot faster than the Emacs Lisp solution above,
they were left behind by this old Awk solution (also using hashing) I
found in the classic /The Unix Programming Environment/ by Kernighan and


awk '    { for (i = 1; i <= NF; i++) num[$i]++ }
END      { for (word in num) print word, num[word] }
' $* | sort +1 -nr | head -10 | awk '{ print $1 }'

I appended the last awk pipeline to only give the words without the
counts.  I wrapped it up in an Emacs command to display the words in
another buffer, just like my original Emacs Lisp solution above.

Udyant Wig
We make our discoveries through our mistakes: we watch one another's
success: and where there is freedom to experiment there is hope to
                                -- Arthur Quiller-Couch

reply via email to

[Prev in Thread] Current Thread [Next in Thread]