[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Language identification
From: |
Juri Linkov |
Subject: |
Re: Language identification |
Date: |
Fri, 28 Aug 2009 22:08:28 +0300 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/23.1.50 (x86_64-pc-linux-gnu) |
>>> In `auto-mode-alist' you can see that with the exception of
>>> `archive-mode', `doc-view-mode' and `image-mode', all remaining
>>> modes are programming text modes. It would be more useful
>>> to identify file types for these modes that libmagic can't do.
>>> Do you know a library that identifies programming languages?
>>> Such a library might be implemented using a Bayesian classifier
>>> trained on a sufficiently large corpus of different programming
>>> languages.
>>
>> N-Gram algorithms is could be used to identify languages - it simpler
>> than bayes, and requires smaller database
>
> Sorry, I skipped, that this was about programming languages, not real
> languages.
It would be interesting to try using N-Gram algorithms for programming
languages and see how well they perform. For example, most frequently
used bigram "/*" belongs to C, most frequently used trigram ";;;" belongs
to Lisp, etc.
--
Juri Linkov
http://www.jurta.org/emacs/
- Re: Language identification, (continued)
- Re: Language identification, Richard Stallman, 2009/08/29
- Re: Language identification, Juri Linkov, 2009/08/29
- Re: Language identification, Richard Stallman, 2009/08/30
- Re: Language identification, David Kastrup, 2009/08/31
- Re: Language identification, Jan D., 2009/08/31
- Re: Language identification, Richard Stallman, 2009/08/30
Re: Language identification, Alex Ott, 2009/08/28
Re: Language identification, Alex Ott, 2009/08/28
- Re: Language identification,
Juri Linkov <=