help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Most used words in current buffer


From: Udyant Wig
Subject: Re: Most used words in current buffer
Date: Thu, 19 Jul 2018 18:56:09 +0530
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1

On 07/19/2018 12:34 PM, Bob Proulx wrote:
> Hmm...  I think looking behind the abstract data type at how it might
> or might not be implemented is a stretch to say the least.  The entire
> reason it is an abstract data type is to hide those types of
> implementation details.  :-)

Indeed.  The principle generalizes to abstract functionality as well,
doesn't it?  E.g. qsort in the C library may or may not be an
implementation of quick sort; or closer to the topic of the newsgroup,
one ought not to care about the actual algorithm of the SORT function in
Emacs.

Or, to illustrate the point in another way, I like that I can operate on
sequences in Emacs Lisp, rather than have to commit to either lists or
vectors right at the start.

> The naming of things usually says more about the person that named the
> thing than the thing itself.  Associative arrays is a naming that
> reflects the concepts involved in what it does.  This is the same as
> when someone else names it a map table, or a dictionary.  Those are
> all the same thing.  Just using different names because people took
> different paths to get there.

Yes.  Just as, arguably, vectors are a special case of the general
concept of arrays, though the terms are commonly used to name the same
thing.

> Hash tables are attractive because in their perfect form they are O(1)
> order of the function being a fixed amount for any lookup.  Being a
> fixed amount sounds very good and therefore authors often implement
> using hash tables.  However achieving that perfect O(1) performance
> requires knowing the size of the needed hash table at creation time so
> that it can be sized correctly from the start and that is not the case
> here.  Therefore when implemented as a hash table the table will need
> to be grown at various intervals as the number of entries contained is
> increased.  That growth is lumped and consumes time and resources for
> the growth at the time the table is grown.
>
> For such things I generally prefer balanced tree structures because
> work is amortized instead of lumped.  But the important point here is
> that for every algorithm + data structure there is a trade-off of some
> sort between one thing and another thing.

Hmm.  I had written a tree version of the word counter I had mentioned
before.  I had stumbled upon the AVL tree package in Emacs and thought I
might try using it.  This tree-based attempt turned out to be slower
than my straightforward hashing solution.

I have no doubts this code could be written better by someone more
experienced than I.

---
(require 'cl-lib)
(require 'avl-tree)

(defun buffer-most-used-words-2 (n)
  "Make a list of the N most used words in buffer."
  (let ((counts (avl-tree-create (lambda (wc1 wc2)
                                   (string< (first wc1) (first wc2)))))
        (words (split-string (buffer-string)))
        sorted-counts)
    (dolist (word words)
      (let ((element (avl-tree-member counts (list (downcase word) 0))))
        (if element
            (progn
              (avl-tree-delete counts element)
              (avl-tree-enter counts (list (downcase word)
                                           (1+ (second element)))))
            (avl-tree-enter counts (list (downcase word) 1)))))
    (setf sorted-counts (cl-sort (avl-tree-flatten counts) #'>
                                 :key #'second))
    (mapcar #'first (cl-subseq sorted-counts 0 n))))
---

> I am in total agreement over using sed instead of head if you want to
> do that.  Seeing 'sed 20q' should roll off the keyboard as print lines
> until line 20 and then quit.  Very simple and to the point.  There is
> definitely no need for a separate head command.  Other than for
> symmetry with tail which is not as simple in sed.

I see that.  You could implement head on top of sed if you wanted to.  I
myself have been using head for long enough for its stated purpose that
grasping a sed equivalent was not immediately obvious.

> I didn't go look up the reference in the book before posting my first
> reply.  I probably should have refreshed my knowledge of it.  Then I
> would have realized that we weren't talking about literal hashing of
> tokens (like what is done in SpamAssassin's Bayes engine for one
> example) but rather use of "hash" as a table lookup method.  That
> caused me to say that first.  Sorry.

No, I made the original mistake.  I should have made it clear that I was
speaking of hash tables.  The general word 'hashing' was misleading.
Mea culpa.

> As to the old style command line options for head and sort I was just
> wanting to keep in people's minds that since a decade ago or so the
> new format of the command line options is now the preferred one.  That
> old style is certainly how we wrote those things back in the day
> though.

These things do take time to gain currency, don't they?  Under Linux,
for example, the ip set of commands has been named the successor to
ifconfig, and it too is taking time to diffuse into general knowledge.

(And, although there have been a number of revisions of Standard C since
1989/1990, a lot of projects still write to that now legacy standard.
But there may be other issues to consider here.)

> Bob

Udyant Wig
-- 
We make our discoveries through our mistakes: we watch one another's
success: and where there is freedom to experiment there is hope to
improve.
                                -- Arthur Quiller-Couch



reply via email to

[Prev in Thread] Current Thread [Next in Thread]