[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: How to generate a wordlist for a document
From: |
Richard Fieldsend |
Subject: |
Re: How to generate a wordlist for a document |
Date: |
Tue, 16 Aug 2011 14:33:56 +0100 (BST) |
Hi Thorsten,
you haven't mentioned which OS you are running, or whether you want to include
LaTeX commands. Assuming that you are only interested in the text of the
document I would recommend the following steps:
1) For each of the files in your multi-file document run 'detex' to remove all
of the TeX and LaTeX formatting.
2) Compile a single file containing the detex'd versions of the files using cat:
cat file1 >> completefile
3) You can then make the file one word per line, then sort it and make each
term appear just once by doing the following:
grep -o -E '\w+' *sourcefile* | sort | uniq > output
If you need word frequency information then you can make uniq prepend the
number of occurences.
For the record, this doesn't lowercase anything so multiple occurences of the
same word are likely.
HTH
Richard
----- Original Message -----
From: Thorsten <quintfall@googlemail.com>
To: help-gnu-emacs@gnu.org
Cc:
Sent: Monday, 15 August 2011, 22:20
Subject: How to generate a wordlist for a document
Hi list,
how do I generate a word list for a document in Emacs (in my case a
multi-file LaTex document)?
(With wordlist I mean a list with all unique words in the document)
Thanks for any hints
Thorsten