help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: How to generate a wordlist for a document


From: Richard Fieldsend
Subject: Re: How to generate a wordlist for a document
Date: Tue, 16 Aug 2011 14:33:56 +0100 (BST)

Hi Thorsten,
you haven't mentioned which OS you are running, or whether you want to include 
LaTeX commands.  Assuming that you are only interested in the text of the 
document I would recommend the following steps:

1) For each of the files in your multi-file document run 'detex' to remove all 
of the TeX and LaTeX formatting.
2) Compile a single file containing the detex'd versions of the files using cat:

cat file1 >> completefile

3) You can then make the file one word per line, then sort it and make each 
term appear just once by doing the following:

grep -o -E '\w+' *sourcefile* | sort | uniq > output

If you need word frequency information then you can make uniq prepend the 
number of occurences.

For the record, this doesn't lowercase anything so multiple occurences of the 
same word are likely.

HTH

Richard

----- Original Message -----
From: Thorsten <quintfall@googlemail.com>
To: help-gnu-emacs@gnu.org
Cc: 
Sent: Monday, 15 August 2011, 22:20
Subject: How to generate a wordlist for a document

Hi list,
how do I generate a word list for a document in Emacs (in my case a
multi-file LaTex document)?
(With wordlist I mean a list with all unique words in the document)
Thanks for any hints
Thorsten



reply via email to

[Prev in Thread] Current Thread [Next in Thread]