[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Most used words in current buffer
From: |
Udyant Wig |
Subject: |
Re: Most used words in current buffer |
Date: |
Sun, 22 Jul 2018 01:09:25 +0530 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 |
On 07/21/2018 11:52 PM, Stefan Monnier wrote:
>> (defun buffer-most-used-words-2 (n)
>> "Make a list of the N most used words in buffer."
>> (let ((counts (avl-tree-create (lambda (wc1 wc2)
>> (string< (first wc1) (first wc2)))))
>> (words (split-string (buffer-string)))
>
> If you want to go fast, don't use split-string+buffer-string. Scan
> through the buffer and extract each word with buffer-substring
> directly.
>
>> (let ((element (avl-tree-member counts (list (downcase word)
>>0))))
>
> I'd use a hash-table (implemented in C) rather than an avl-tree
> (implemented in Elisp).
After spending (too) many hours on this, I believe that I have a better
solution.
---
(require 'cl-lib)
;; Can this hack be made better?
(defun whitespace-p (char)
(or (eq char 9) (eq char 10) (eq char 13)) (eq char 32))
(defun buffer-most-used-words-3 (n)
"Make a list of the N most used words in buffer."
(let ((counts (make-hash-table :test #'equal))
sorted-counts)
(save-excursion
(goto-char (point-min))
(cl-loop with word = nil
with start = 0
with end = 0
with state = 'space
with char = nil
until (eobp)
do
(setf char (char-after))
(cond ((eq state 'space)
(when (not (whitespace-p char))
(setf start (point)
state 'word)))
((eq state 'word)
(when (whitespace-p char)
(setf end (point)
state 'space
word (buffer-substring start end))
(incf (gethash word counts 0)))))
(forward-char)))
(cl-loop for word being the hash-keys of counts
using (hash-values count)
do
(push (list word count) sorted-counts)
finally (setf sorted-counts (cl-sort sorted-counts #'>
:key #'second)))
(mapcar #'first (cl-subseq sorted-counts 0 n))))
---
In regard to performance, it is slightly better than
BUFFER-MOST-USED-WORDS-1, which used a combination of SPLIT-STRING on
BUFFER-STRING along with a hash-table. Here are timings over ten runs
each for a 4.5 MB text file:
buffer-most-used-words-1: 4.7362510517 seconds
buffer-most-used-words-3: 4.4849896529 seconds
> Stefan
Udyant Wig
--
We make our discoveries through our mistakes: we watch one another's
success: and where there is freedom to experiment there is hope to
improve.
-- Arthur Quiller-Couch
- Re: Most used words in current buffer, (continued)
- Message not available
- Re: Most used words in current buffer, Udyant Wig, 2018/07/22
- Message not available
- Re: Most used words in current buffer, Udyant Wig, 2018/07/20
- Re: Most used words in current buffer, Stefan Monnier, 2018/07/21
- Re: Most used words in current buffer, tomas, 2018/07/22
- Re: Most used words in current buffer, Bob Proulx, 2018/07/23
- Re: Most used words in current buffer, tomas, 2018/07/23
- Message not available
- Re: Most used words in current buffer, Udyant Wig, 2018/07/23
- Message not available
- Re: Most used words in current buffer, Udyant Wig, 2018/07/22
- Message not available
- Re: Most used words in current buffer,
Udyant Wig <=
- Re: Most used words in current buffer, Stefan Monnier, 2018/07/21
- Message not available
- Re: Most used words in current buffer, Udyant Wig, 2018/07/22