Re: Most used words in current buffer

help-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Most used words in current buffer

From:	Udyant Wig
Subject:	Re: Most used words in current buffer
Date:	Sun, 22 Jul 2018 23:49:01 +0530
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1

On 07/22/2018 09:27 AM, Eric Abrahamsen wrote:
> As Stefan said, going character by character is going to be
> slow... But my example with `forward-word' collects a lot of cruft. So
> I would suggest doing what `forward-word' does internally and move by
> syntax.  This also opens up the possibility of tweaking the behavior
> of your function (ie, what constitutes a word) by setting temporary
> syntax tables. Here's a word scanner that only picks up actual words
> (according to the default syntax table):
>
> (defun test-buffer (&optional f)
>   (let ((file (or f "/home/eric/org/hollowmountain.org"))
>       pnt lst)
>     (with-temp-buffer
>       (insert-file-contents file)
>       (goto-char (point-min))
>       (skip-syntax-forward "^w")
>       (setq pnt (point))
>       (while (and (null (eobp)) (skip-syntax-forward "w"))
>       (push (buffer-substring pnt (point)) lst)
>       (skip-syntax-forward "^w")
>       (setq pnt (point))))
>     (nreverse lst)))

Thank you for the idea!  It did wonders for the running time, a sample
of which I have put after the following adaption of your idea to the
code.

---
(defun buffer-most-used-words-4 (n)
  "Make a list of the N most used words in buffer."
  (let ((counts (make-hash-table :test #'equal))
        sorted-counts
        start
        end)
    (save-excursion
      (goto-char (point-min))
      (skip-syntax-forward "^w")
      (setf start (point))
      (cl-loop until (eobp)
               do
               (skip-syntax-forward "w")
               (setf end (point))
               (incf (gethash (buffer-substring start end) counts 0))
               (skip-syntax-forward "^w")
               (setf start (point))))
    (cl-loop for word being the hash-keys of counts
             using (hash-values count)
             do
             (push (list word count) sorted-counts)
             finally (setf sorted-counts (cl-sort sorted-counts #'>
                                                  :key #'second)))
    (mapcar #'first (cl-subseq sorted-counts 0 n))))
---

Compiled, this version takes about half the time the previous version --
going character by character -- took to process a 4.5 MB text file.

Average timing after ten runs on the above mentioned file: 2.75 seconds.


On syntax tables, the ability to determine what is a word or other
construct in a buffer could be very handy indeed.  One application
beyond prose text that comes to mind could be to count the most used
variable or function in a file of source code.  There might be others of
course.

Udyant Wig
-- 
We make our discoveries through our mistakes: we watch one another's
success: and where there is freedom to experiment there is hope to
improve.
                                -- Arthur Quiller-Couch

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Most used words in current buffer, (continued)

Prev by Date: Re: Insert edebug breakpoint into code
Next by Date: Re: Most used words in current buffer
Previous by thread: Re: Most used words in current buffer
Next by thread: Re: Most used words in current buffer
Index(es):
- Date
- Thread