Re: Most used words in current buffer

help-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Most used words in current buffer

From:	Udyant Wig
Subject:	Re: Most used words in current buffer
Date:	Wed, 18 Jul 2018 15:06:56 +0530
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1

On 07/18/2018 12:11 AM, Emanuel Berg wrote:
> Do it!
>
> But if you can let go of the Elisp requirement here are some examples
> how to do it with everyday GNU/Unix tools:
>
>
https://unix.stackexchange.com/questions/41479/find-n-most-frequent-words-in-a-file

I went ahead and did it.  I obtained many solutions, in fact.  Only
today did I check the link above.

First, of the solutions in Emacs Lisp, this one came out as the
quickest:

---
(defun buffer-most-used-words-1 (n)
  "Make a list of the N most used words in buffer."
  (let ((counts (make-hash-table :test #'equal))
        (words (split-string (buffer-string)))
        sorted-counts)
    (dolist (word words)
      (let ((count (gethash (downcase word) counts 0)))
        (puthash (downcase word) (1+ count) counts)))
    (loop for word being the hash-keys of counts
       using (hash-values count)
       do
         (push (list word count) sorted-counts)
       finally (setf sorted-counts (cl-sort sorted-counts #'>
                                            :key #'second)))
    (mapcar #'first (cl-subseq sorted-counts 0 n))))
---

Briefly, it obtains a list of the strings in the buffer, hashes them,
puts the words and their counts in a list, sorts it, and lists the first
N words.  (I had also written solutions (1) using alists; (2) using the
handy AVL tree library I found among the Emacs Lisp files in the Emacs
distribution; and (3) reading the words directly and hashing them.  None
beat the above.)

The function is suffixed with '-1' because it is the the core of
another, interactive function, which takes the above generated list and
displays it nicely in another buffer.

I was curious about possible solutions in other languages.  I wrote
programs in both Common Lisp and Python, based on the essential hash
table approach.  While a lot faster than the Emacs Lisp solution above,
they were left behind by this old Awk solution (also using hashing) I
found in the classic /The Unix Programming Environment/ by Kernighan and
Pike:

---
#!/bin/sh

awk '    { for (i = 1; i <= NF; i++) num[$i]++ }
END      { for (word in num) print word, num[word] }
' $* | sort +1 -nr | head -10 | awk '{ print $1 }'
---

I appended the last awk pipeline to only give the words without the
counts.  I wrapped it up in an Emacs command to display the words in
another buffer, just like my original Emacs Lisp solution above.

Udyant Wig
-- 
We make our discoveries through our mistakes: we watch one another's
success: and where there is freedom to experiment there is hope to
improve.
                                -- Arthur Quiller-Couch

[Prev in Thread]

Current Thread

[Next in Thread]

Most used words in current buffer, Udyant Wig, 2018/07/17
- Re: Most used words in current buffer, Emanuel Berg, 2018/07/17
  - Re: Most used words in current buffer, Udyant Wig <=
    - Re: Most used words in current buffer, Emanuel Berg, 2018/07/18
    - Re: Most used words in current buffer, Udyant Wig, 2018/07/18
    - Re: Most used words in current buffer, Emanuel Berg, 2018/07/18
    - Re: Most used words in current buffer, Ben Bacarisse, 2018/07/18
    - Re: Most used words in current buffer, Bob Proulx, 2018/07/18
    - Message not available
    - Re: Most used words in current buffer, Udyant Wig, 2018/07/19
    - Re: Most used words in current buffer, Bob Proulx, 2018/07/19
    - Re: Most used words in current buffer, tomas, 2018/07/19
    - Re: Most used words in current buffer, Nick Dokos, 2018/07/19
    - Re: Most used words in current buffer, Eli Zaretskii, 2018/07/19

Prev by Date: Re: BUG in Tramp: After sleep, saving file hangs
Next by Date: Compiler warning
Previous by thread: Re: Most used words in current buffer
Next by thread: Re: Most used words in current buffer
Index(es):
- Date
- Thread