Re: Most used words in current buffer

help-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Most used words in current buffer

From:	Udyant Wig
Subject:	Re: Most used words in current buffer
Date:	Sun, 22 Jul 2018 01:09:25 +0530
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1

On 07/21/2018 11:52 PM, Stefan Monnier wrote:
>> (defun buffer-most-used-words-2 (n)
>>   "Make a list of the N most used words in buffer."
>>   (let ((counts (avl-tree-create (lambda (wc1 wc2)
>>                                 (string< (first wc1) (first wc2)))))
>>      (words (split-string (buffer-string)))
>
> If you want to go fast, don't use split-string+buffer-string.  Scan
> through the buffer and extract each word with buffer-substring
> directly.
>
>>       (let ((element (avl-tree-member counts (list (downcase word)
>>0))))
>
> I'd use a hash-table (implemented in C) rather than an avl-tree
> (implemented in Elisp).

After spending (too) many hours on this, I believe that I have a better
solution.

---
(require 'cl-lib)

;; Can this hack be made better?
(defun whitespace-p (char)
  (or (eq char 9) (eq char 10) (eq char 13)) (eq char 32))

(defun buffer-most-used-words-3 (n)
  "Make a list of the N most used words in buffer."
  (let ((counts (make-hash-table :test #'equal))
        sorted-counts)
    (save-excursion
      (goto-char (point-min))
      (cl-loop with word = nil
               with start = 0
               with end = 0
               with state = 'space
               with char = nil
               until (eobp)
               do
               (setf char (char-after))
               (cond ((eq state 'space)
                      (when (not (whitespace-p char))
                        (setf start (point)
                              state 'word)))
                     ((eq state 'word)
                      (when (whitespace-p char)
                        (setf end (point)
                              state 'space
                              word (buffer-substring start end))
                        (incf (gethash word counts 0)))))
               (forward-char)))
    (cl-loop for word being the hash-keys of counts
             using (hash-values count)
             do
             (push (list word count) sorted-counts)
             finally (setf sorted-counts (cl-sort sorted-counts #'>
                                                  :key #'second)))
    (mapcar #'first (cl-subseq sorted-counts 0 n))))
---

In regard to performance, it is slightly better than
BUFFER-MOST-USED-WORDS-1, which used a combination of SPLIT-STRING on
BUFFER-STRING along with a hash-table.  Here are timings over ten runs
each for a 4.5 MB text file:

buffer-most-used-words-1:    4.7362510517 seconds
buffer-most-used-words-3:    4.4849896529 seconds

>         Stefan

Udyant Wig
-- 
We make our discoveries through our mistakes: we watch one another's
success: and where there is freedom to experiment there is hope to
improve.
                                -- Arthur Quiller-Couch

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Most used words in current buffer, (continued)

Prev by Date: Re: Most used words in current buffer
Next by Date: Re: Most used words in current buffer
Previous by thread: Re: Most used words in current buffer
Next by thread: Re: Most used words in current buffer
Index(es):
- Date
- Thread