Re: Language identification

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Language identification

From:	Juri Linkov
Subject:	Re: Language identification
Date:	Fri, 28 Aug 2009 22:08:28 +0300
User-agent:	Gnus/5.13 (Gnus v5.13) Emacs/23.1.50 (x86_64-pc-linux-gnu)

>>> In `auto-mode-alist' you can see that with the exception of
>>> `archive-mode', `doc-view-mode' and `image-mode', all remaining
>>> modes are programming text modes.  It would be more useful
>>> to identify file types for these modes that libmagic can't do.
>>> Do you know a library that identifies programming languages?
>>> Such a library might be implemented using a Bayesian classifier
>>> trained on a sufficiently large corpus of different programming
>>> languages.
>>
>> N-Gram algorithms is could be used to identify languages - it simpler
>> than bayes, and requires smaller database
>
> Sorry, I skipped, that this was about programming languages, not real
> languages.

It would be interesting to try using N-Gram algorithms for programming
languages and see how well they perform.  For example, most frequently
used bigram "/*" belongs to C, most frequently used trigram ";;;" belongs
to Lisp, etc.

-- 
Juri Linkov
http://www.jurta.org/emacs/

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Language identification, (continued)

Prev by Date: Re: Status of IPA patch?
Next by Date: Re: Status of IPA patch?
Previous by thread: Re: Language identification
Next by thread: is the bug tracker down?
Index(es):
- Date
- Thread