help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Any faster way to find frequency of words?


From: Jean Louis
Subject: Re: Any faster way to find frequency of words?
Date: Mon, 10 May 2021 10:14:04 +0300
User-agent: Mutt/2.0.6 (2021-03-06)

* Eric Abrahamsen <eric@ericabrahamsen.net> [2021-05-10 06:38]:
> > It is also useful to generate tags for particular text, that helps me
> > to curate WWW pages.
> 
> Right, but what I meant was, is there anything wrong with the
> implementation you posted?

Thank you. It gives me practically the wanted result, theoretically I
have not tested it well to say if maybe something technically is
wrong. And I use it on smaller chunks of text, it appears pretty fast
and it would be very slow if I would be using it on huge number of
documents. 

On a document of 246000 bytes it takes few seconds. But is not a
problem, I have not get too many such documents and I am not
iterating. It is for generation of tags.

I think this is full set of functions:

(defun hash-to-list (hash)
  "Convert hash HASH to list"
  (let (list)
    (maphash (lambda (key value) (setq list (append list (list (list key 
value))))) hash)
    list))

(defun text-alphabetic-only (text)
  "Return alphabetic characters from TEXT."
  (replace-regexp-in-string "[^[:alpha:]]" " " text))

(defun rcd-word-frequency (text &optional length)
  "Returns word frequency as hash from TEXT.

Words smaller than LENGTH are discarded from counting."
  (let* ((hash (make-hash-table :test 'equal))
         (text (text-alphabetic-only text))
         (length (or length 3))
         (words (split-string text " " t " "))
         (words (mapcar 'downcase words))
         (words (mapcar (lambda (word) (when (> (length word) length) word)) 
words))
         (words (delq nil words)))
    (mapc (lambda (word)
            (puthash word (1+ (gethash word hash 0)) hash))
          words)
    hash))

(defun rcd-word-frequency-list (text &optional length)
  "Return the unsorted word frequency list of pairs.

First item of the pair is the word, second the word count.

It will analyze TEXT, with minimum word LENGTH."
  (let* ((words (rcd-word-frequency text length))
         (words (hash-to-list words))
         (frequent (seq-sort (lambda (a b)
                               (> (cadr a) (cadr b)))
                             words)))
    frequent))

(defun rcd-word-frequency-string (text &optional length how-many)
  "Return string with most frequent words in TEXT.

Use LENGTH to designate minimum length of words to analyze.

Return HOW-MANY words"
  (let ((frequent (rcd-word-frequency-list text length)))
    (mapconcat (lambda (a) (car a)) (butlast frequent (- (length frequent) 
how-many)) " ")))

(defun rcd-word-frequency-buffer (&optional how-many)
  (interactive)
  (let* ((how-many (or how-many (read-number "How many most frequent words you 
wish to see? ")))
         (text (buffer-string))
         (frequent (rcd-word-frequency-list text))
         (report (mapconcat (lambda (a) (format "%s:%s " (car a) (cadr a))) 
(butlast frequent (- (length frequent) how-many)) " ")))
    (prog1
        report
      (message report))))

(rcd-word-frequency-buffer 10) ⇒ "word:44  words:35  text:28  hash:28  
length:25  list:17  frequency:16  frequent:14  many:11  lambda:word:44  
words:35  text:28  hash:28  length:25  list:17  frequency:16  frequent:14  
many:11  lambda:11 

> >> I guess I'd suggest using Emacs syntax parsing functions, ie
> >> `forward-word' and `buffer-substring'. Then you can fine tune the
> >> definition of words using the local syntax table.
> >
> > That is also interesting approach, it could just go over the words and
> > enter them into list.
> 
> Yes, and it can help you skip garbage characters that shouldn't count as
> words. Things like `(skip-syntax-forward "^w")` (meaning "skip a run of
> characters that aren't word constituents") can be very useful.

For now I just skip words by its length and count those alphabetic
characters. Purpose is just to generate tags for HTML pages. 

Once tags have been generated, I can use PostgreSQL database to find
documents with most frequent tags.

Generation of tags is human curated, not automatic. Thus such function
is invoked rather on specific documents. It suggests me the tags for
editing. Not that is creates tags without my attendance.

For example "https" does not seem quite useful tag if articles does
not speak of it, so I have to delete such tags.

> > Words smaller than LENGTH are discarded from counting."
> >   (let* ((hash (make-hash-table :test 'equal))
> >      (text (text-alphabetic-only text))
> >      (length (or length 3))
> >      (words (split-string text " " t " "))
> >      (words (mapcar 'downcase words))
> >      (words (mapcar (lambda (word) (when (> (length word) length) word)) 
> > words))
> >      (words (delq nil words)))
> >     (mapc (lambda (word)
> >         (puthash word (1+ (gethash word hash 0)) hash))
> 
> I totally forgot that `gethash' has a default argument! So the line
> above can just be:
> 
> (cl-incf (gethash word hash 0))

You like cl-incf and I use 1+, I am not sure if this macro would maybe
slow it down. That is why I tend to skip macros. And let us say I wish
to make package for word frequencies, it would not need to require
cl-lib library.

(defmacro cl-incf (place &optional x)
  "Increment PLACE by X (1 by default).
PLACE may be a symbol, or any generalized variable allowed by `setf'.
The return value is the incremented value of PLACE."
  (declare (debug (place &optional form)))
  (if (symbolp place)
      (list 'setq place (if x (list '+ place x) (list '1+ place)))
    (list 'cl-callf '+ place (or x 1))))

> > (defun rcd-word-frequency-string (text &optional length how-many-words)
> >   (let* ((words (rcd-word-frequency text length))
> >      (words (hash-to-list words))
> >      (number (or how-many-words 20))
> >      (frequent (seq-sort (lambda (a b)
> >                            (> (cadr a) (cadr b)))
> >                          words)))
> >     (mapconcat (lambda (a) (car a)) (butlast frequent (- (length frequent) 
> > number)) " ")))
> 
> I don't have a `hash-to-list' function, but once you've built your table
> it seems like the rest of it is fairly straightforward.

I use those functions below.

;;;; ━━━━━━━━━━━━━━━━━━
;;;;   HASH FUNCTIONS
;;;; ━━━━━━━━━━━━━━━━━━

(defun hash-to-plist (hash)
  "Convert hash HASH to plist."
  (let (plist)
    (maphash (lambda (key value) (push key plist) (push value plist)) hash)
    (reverse plist)))

(defun hash-to-alist (hash)
  "Convert hash HASH to alist"
  (let (alist)
    (maphash (lambda (key value) (push (cons key value) alist)) hash)
    alist))

(defun hash-to-list (hash)
  "Convert hash HASH to list"
  (let (list)
    (maphash (lambda (key value) (setq list (append list (list (list key 
value))))) hash)
    list))

(defun hash-append (h1 &rest hashes)
  "Return H1 hash appended with HASHES."
  (mapc 
   (lambda (hash)
     (maphash 
      (lambda (key value) (puthash key value h1)) hash))
   hashes)
  h1)



-- 
Jean

Take action in Free Software Foundation campaigns:
https://www.fsf.org/campaigns

Sign an open letter in support of Richard M. Stallman
https://stallmansupport.org/
https://rms-support-letter.github.io/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]